Applied Reconfigurable Computing
Applied Reconfigurable Computing
Applied Reconfigurable Computing
Applied Reconfigurable
Computing
13th International Symposium, ARC 2017
Delft, The Netherlands, April 3–7, 2017
Proceedings
123
Lecture Notes in Computer Science 10216
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zurich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/7407
Stephan Wong Antonio Carlos Beck
•
Applied Reconfigurable
Computing
13th International Symposium, ARC 2017
Delft, The Netherlands, April 3–7, 2017
Proceedings
123
Editors
Stephan Wong Koen Bertels
Delft University of Technology Delft University of Technology
Delft Delft
The Netherlands The Netherlands
Antonio Carlos Beck Luigi Carro
Federal University of Rio Grande do Sul Federal University of Rio Grande do Sul
Porto Alegre Porto Alegre
Brazil Brazil
We would like to acknowledge the support of all the members of this year’s Steering
and Program Committees in reviewing papers, in helping with the paper selection, and
in giving valuable suggestions. Special thanks also to the additional researchers who
contributed to the reviewing process, to all the authors who submitted papers to the
symposium, and to all the symposium attendees.
Last but not least, we are especially indebted to Juergen Becker from the University
of Karlsruhe and to Alfred Hoffmann and Anna Kramer from Springer for their support
and work in publishing this book as part of the LNCS series.
General Chairs
Koen Bertels Delft University of Technology, The Netherlands
Luigi Carro Federal University of Rio Grande do Sul, Brazil
Program Chairs
Stephan Wong Delft University of Technology, The Netherlands
Antonio Carlos Beck Federal University of Rio Grande do Sul, Brazil
Finance Chair
Joost Hoozemans Delft University of Technology, The Netherlands
Proceedings Chair
Hamid Mushtaq Delft University of Technology, The Netherlands
Sponsorship Chair
Pedro Diniz USC Information Sciences Institute, USA
Publicity Chairs
Sorin Cotofana Delft University of Technology, The Netherlands
Pedro Diniz USC Information Sciences Institute, USA
Chao Wang University of Science and Technology of China, China
Web Chair
Johan Peltenburg Delft University of Technology, The Netherlands
Steering Committee
Hideharu Amano Keio University, Japan
Jürgen Becker Universität Karlsruhe (TH), Germany
Mladen Berekovic Braunschweig University of Technology, Germany
Koen Bertels Delft University of Technology, The Netherlands
João M.P. Cardoso University of Porto, Portugal
Katherine Morrow University of Wisconsin-Madison, USA
George Constantinides Imperial College London, UK
VIII Organization
Program Committee
Hideharu Amano Keio University, Japan
Zachary Baker Los Alamos National Laboratory, USA
Juergen Becker Karlsruhe Institute of Technology, Germany
Mladen Berekovic TU Braunschweig, Germany
Joao Bispo Universidade Técnica de Lisboa, Portugal
Michaela Blott Xilinx, Ireland
Vanderlei Bonato University of Sao Paulo, Brazil
Christos Bouganis Imperial College London, UK
João Canas Ferreira University of Porto, Portugal
Cyrille Chavet Université de Bretagne-Sud, France
Daniel Chillet Université de Rennes, France
Rene Cumplido Inst. Nacional de Astrofísica, Óptica y Electrónica, Mexico
Florent de Dinechin Université de Lyon, France
Steven Derrien Université de Rennes, France
Antonio Ferrari Universidade de Aveiro, Portugal
Ricardo Ferreira Universidade Federal de Vicosa, Brazil
Roberto Giorgi University of Siena, Italy
Diana Goehringer Ruhr University Bochum, Germany
Marek Gorgon AGH University of Science and Technology, Poland
Frank Hannig Friedrich Alexander University Erlangen-Nürnberg,
Germany
Jim Harkin University of Ulster, UK
Dominic Hillenbrand Karlsruhe Institute of Technology, Germany
Christian Hochberger TU Darmstadt, Germany
Michael Huebner Ruhr University Bochum, Germany
Waqar Hussain Tampere University of Technology, Finland
Fernanda Kastensmidt Federal University of Rio Grande do Sul, Brazil
Krzysztof Kepa GE Global Research, USA
Georgios Keramidas Technological Educational Institute of Western Greece,
Greece
Andreas Koch TU Darmstadt, Germany
Dimitrios Kritharidis Intracom Telecom, Greece
Tomasz Kryjak AGH University of Science and Technology, Poland
Vianney Lapotre Université de Bretagne-Sud, France
Philip Leong The Chinese University of Hong Kong, SAR China
Eduardo Marques University of Sao Paulo, Brazil
Konstantinos Masselos University of the Peloponnese, Greece
Cathal McCabe Xilinx, Ireland
Daniel Mesquita Universidade Federal do Pampa, Brazil
Antonio Miele Politecnico di Milano, Italy
Organization IX
Additional Reviewers
Onur Mutlu
Walid Najjar
Patrick Lysaght
Abstract. In this talk, modern software trends will be explored with a focus on
how we can enable software developers to exploit the benefits of reconfigurable
hardware. This talk introduces PYNQ, a new open-source framework for
designing with Xilinx Zynq devices, a class of All Programmable Systems on
Chip (APSoCs) which integrates multiple processors and Field Programmable
Gate Arrays (FPGAs) into single integrated circuits. The main goal of the
framework is to make it easier for designers of embedded systems to use
APSoCs in their applications. The APSoC is programmed in Python and the
code is developed and tested directly on the embedded system. The pro-
grammable logic circuits are imported as hardware libraries and programmed
through their APIs, in essentially the same way that software libraries are
imported and programmed. The framework combines three main elements:
– The use of a high-level productivity language, Python in this case
– Python-callable hardware libraries based on FPGA overlays
– A web-based architecture incorporating the open-source Jupyter Notebook
infrastructure served from Zynq’s embedded processors
The result is a programming environment that is web-centric so it can be accessed
from any browser on any computing platform or operating system. It enables
software programmers to work at higher levels of design abstraction and to re-use
both software and hardware libraries for reconfigurable computing. The frame-
work is inherently extensible and integrates coherently with hardware-dependent
code written in C and C++. The talk concludes with an outline of areas for
continued development, and a call for community participation.
Contents
Adaptive Architectures
Fault Tolerance
FPGA-Based Designs
Neural Networks
1 Introduction
ρ-VEX processor [1] is a reconfigurable and extensible softcore very long instruc-
tion word (VLIW) processor. It differs from traditional VLIW processors, in that
the issue-width is parameterized from 2 to 4 to 8 - the core contains a maximum
of 8 datapaths. A key motivation of the ρ-VEX processor design is to utilize only
the necessary resources when needed. The dynamic nature of the ρ-VEX proces-
sor requires an adaptive cache organization that can combine several caches into
a larger sized one, or separate a larger cache into smaller sized ones as depicted
in Fig. 1—this is commonly referred to as cache resizing.
With these considerations, we investigated the effects of cache resizing trig-
gered by the issue-width mode changes (caused by external factors) of the ρ-VEX
This work is supported in part by the National Natural Science Foundation of China
under grant NSFC-61300011 and NSFC-61300010. The authors would like to thank
the China Scholarship Council (CSC) for their financial support.
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 3–15, 2017.
DOI: 10.1007/978-3-319-56258-2 1
4 S. Hu et al.
8-issue (8W)
4-issue (4W) 4-issue (4W) 2-issue 2-issue 2-issue 2-issue
VLIW
VLIW VLIW (2W) (2W) (2W) (2W)
processor
D$ (data cache) D$ D$ D$ D$ D$ D$
2 Related Work
Previous studies on the strategy of “when to resize” almost all relied on the miss
ratio or profiling to determine the correct time to resize the cache. In [3,4], these
methods made the decision solely based on monitoring cache miss rate, which
are all miss-driven resizing approaches. [5] used dynamic profiling for predicting
cache usage and energy efficiency of the application under multiple cache config-
urations. However, cache miss rate is not always a good performance indictor [3].
While many factors can affect the cache miss rate, even the minor changes in pro-
gram behavior or available cache size probably causes large changes in miss rate
[2,6]. Such miss-driven resizing approach probably thrashes the performance.
The profiling approaches increase the overheads of hardware or software as well
as miss-driven approaches. Unlike prior work, as far as we are aware, our work
is the first to introduce two events as external trigger to dynamically reconfig-
ure cache when the issue-width mode of VLIW processor changes. Meanwhile,
our method considers reducing the miss rate while downsizing cache in order to
decrease miss penalty and to smoothen the performance.
Finally, our work is implemented on the ρ-VEX VLIW processor [1,7], which
is open-source and has a complete tool chain (compiler, simulator and assembler).
The issue-width of ρ-VEX can be reconfigured dynamically to be 2-issue, 4-issue
and 8-issue at run-time [8]. In [9], the authors implemented generic binaries,
which can execute the same binary on different issue-width processors without
much hardware modifications. This design allows for maintaining live data within
existing cache blocks when the amount of computing resources in the core are
changed. This by itself already results in an improvement of the execution times
by on average 16% (with outliers of 0.7% and 42%) for the MiBench benchmark
suite (also used in this paper) compared to a case in which each resizing event
results in cold starts of the d-cache (and not taking into account i-cache misses).
6 S. Hu et al.
3 General Approach
3.1 When to Resize
disabled state
data cache data cache data cache
way0 way1 way2 way3 way0 way1 way2 way3
transition state
enabled state
Address: Masks
TAG INDEX
8 issue 1 1 1 1
Two Masks
4 issue 0 0 1 1 1 1 0 0
Tag array & Data array
Way-selection Four Masks
logic Selected way
2 issue 0 0 0 1 0 0 1 0
0 1 0 0 1 0 0 0
accessed block to the enabled part of cache during the transition period while
consider the intrinsic temporal locality of the workloads. There are three cases
in the R-LRU replacement algorithm, as Algorithm 1 shows:
– Case 1: hit in the disabled part.
– Case 2: hit in the enabled part.
– Case 3: miss in the whole d-cache.
R-LRU maintains a LRU list L. More precisely, a block in the head of L
means it has been accessed recently while the one in the rear of L means it has
the least access recently. Let block be the referenced cache block. We introduce
three states: taking state enabled (E) to denote a cache way that one core can
access, taking state disabled (D) to denote a cache way that one core can not
access, and taking state transition (T ) to denote a cache way that is in the
transition period when downsizing the cache. Therefore, the transition state is
a transient state to downsize the d-cache.
miss in whole
hit in enabled part hit in disabled part
cache
_ _
evict LRU block
_ _
evict LRU block
_
evict block 2 evict block 2
_ _
evict LRU block
_
evict block 2 evict block 2
_
evict block 1 evict block 1
block in transition state block in enabled block performance maintain case performance promoted case
To show the benefit of the R-LRU algorithm, we explain the advantages using
LRU-based stack as depicted in Fig. 4. There are six different LRU stacks for R-
LRU during the transition period. On the left of the graph, MRU position stands
Improving the Performance of Adaptive Cache 9
for the most recently used block while LRU position stands for the least recently
used block. The position next to MRU in the recency position is referred as
position 1 and the next position as position 2. The shaded block is in transition
state while switch to disabled state after the transition period. On the right of
the graph, the 18 scenarios of cache accesses are listed. In the following, we will
discuss the three cases individually:
Case 1 (hit in the disabled part). A hit in the disabled part is identical to
a miss in the enabled part. The last enabled block of the LRU list is evicted
and we replicate the hit data of to-be-disabled part to this position. Although
the capacity of the hit set probably decreases, R-LRU only replicates the hit
data rather than accessing the next level cache. Hence, the cost of this case is
less than a real miss. Furthermore, there are three scenarios that evict the LRU
10 S. Hu et al.
block and only one scenario that evict block located the secondary of the LRU
list, as shown in the second column of the table. An accessed block exhibits
temporal locality if it is likely to be accessed again in the near future. In this
way, R-LRU increases the amount of most recently used data in the enabled
part, which suggests a benefit from R-LRU replacement algorithm.
Case 2 (hit in the enabled part). For a hit in this part, the R-LRU algorithm
just need to update the LRU list, moving the hit blocks to the head of list. This
is not different from traditional LRU replacement algorithms. The more hits in
the enabled part, the better the locality is in this part. It is effortless to maintain
the performance after downsizing. There is no extra overhead using R-LRU in
this case. As a result, R-LRU maintains cache performance as shown in the first
column of the table.
Case 3 (miss in the whole d-cache). While on a cache miss, the referenced
block is only brought into the enabled part. R-LRU finds the last enabled block in
the LRU list and evicts it. In this case, there are three scenarios that the evicted
blocks occur in the LRU position, which are the same as the conventional LRU
replacement policy. In the rest of three scenarios, there is one eviction occurring
at Position1 and two evictions occurring at Position2. Considering the LRU
block is evicted immediately, R-LRU slightly adjusts the sequence of the LRU
list and brings the new block into the enabled way in advance. In this manner,
we are able to benefit from the R-LRU replacement policy if the LRU block is
no longer accessed before being evicted.
In our framework, the R-LRU policy allows in the transition period to transfer
the accessed data from the to-be-disabled way of cache to the enabled way or
boost the live-ness of the data in the enabled way. Ideally, all the enabled blocks
will be included in the first N nodes of the LRU list after the transition period.
In other words, the entire recently used nodes are located at the head of the
LRU list. When the tswitch approaches, this optimization minimizes cache miss
penalty introduced by downsizing.
5 Evaluation
5.1 Experimental Platform Setup
Our baseline of 8-issue core configuration is presented in Table 1. As explained
above, the largest ρ-VEX core has a four-way set-associative d-cache in the 8-
issue mode. While in the 4-issue mode, the cache is divided over the two cores
and therefore also half cache size (32 Kbytes, 2-way) for each core. Similarly,
in the 2-issue mode each core has a 16 Kbyte direct-mapped cache. We choose
MiBench benchmark suite [11]. The benchmarks were compiled with the vex-3.43
compiler (Hewlett-Packard compiler) using -O3 optimization level and -fno-xnop
-fexpand-div flags. Our experimental platform comprises the following elements:
– ρ-VEX prototype: We use an FPGA to prototype the ρ-VEX and run applica-
tions on actual hardware. The design runs on a Virtex 6 (ML605 development
Improving the Performance of Adaptive Cache 11
board) at 37.5 MHz. A hardware trace unit collects all the executed instruc-
tions for each benchmark on the FPGA prototype of ρ-VEX.
– Cache simulator: We extracted the memory read and write operations from
this traces for use as input to the cache simulator. We extended the DineroIV
[12] cache simulator, which is a sequential trace-driven cache simulator, to be
able to simulate the reconfigurable cache as presented in Sect. 3.
– Core phase predictor [13]: We implemented a simple phase predictor to mea-
sure the ILP of the benchmark traces and predict/decide the most suitable
mode for the ρ-VEX core to execute in. In addition, this predictor takes into
account the trade-offs in terms of delay, energy consumption, and the energy
delay product (EDP) to make the phase predictions.
5.2 Methodology
6 Results
6.1 The Impact of the Interval of Transition
Figure 5 depicts how the interval of transition affects the performance, i.e., the
cache downsizing occurs when the mode switched from 8-issue mode to 4-issue
mode, from 8-issue mode to 2-issue mode and from 4-issue mode to 2-issue
mode, respectively. The y-axis of three graphs represents the decreasing number
of misses in 2000 cycles (bundles) after the actual downsizing compared to the
immediate cache downsizing (normalized to immediate cache downsizing in the
same execution point). We vary the interval of transition period ranging from
10 cycles to 10000 cycles (x-axis).
In the three scenarios of downsizing, our framework presents the same
decreasing tendency of cache misses, which clearly demonstrates an advantage
over the immediate cache downsizing approach. The longer the transition period
is, the more the reduction in cache miss rate is. When the transition period is set
to 2000 cycles, the majority of the benchmarks result in a near-optimal perfor-
mance. More specifically, as shown in Fig. 5, our framework achieves a reduction
Fig. 5. The impact of the transition interval on cache misses with the mode change.
The black lines indicate the average for all benchmarks.
Improving the Performance of Adaptive Cache 13
in misses of on average 13% for 8-issue to 4-issue, 26% for 8-issue to 2-issue and
16% for 4-issue to 2-issue, respectively. The figure shows that for switches from
8-issue to 4-issue, for 16 benchmarks, the number of cache misses continuously
decrease. It is also true for 11 benchmarks in when switching from 8-issue to
2-issue and for 12 benchmarks when switching from 4-issue to 2-issue.
Figure 6 depicts the MiBench benchmark’s cumulative lasting effect for every
mode change given the transition period is 2000 (bundle) cycles, from which
we can observe that the cache misses curve of our approach (normalized to the
cache miss rate due to immediate cache resizing without using our approach)
gradually approaches y = 1 (the immediate resizing curve) rather than jumps
to it directly. The area between every curve and y = 1 shows the advantages by
using our framework.
Fig. 6. The lasting effect with execution time. The black line indicates the average for
all benchmarks.
For all the benchmarks, we can also achieve the decline of cache miss by
about, on average, 16% (2000 cycles after resizing), 14% (4000 cycles) and 9%
(7500 cycles after resizing). From our experiments, our framework can improve
the performance of the cache more than 6000 (bundle) cycles after cache resizing
given the transition interval is 2000 (bundle) cycles. Such framework could be
particularly useful in the scenario from a four-way set-associative cache to a
two-way set-associative cache.
Without the transition period provided by our framework, the downsizing
moments will result in sharp jumps in cache misses (y = 1 in Fig. 6). The
upsizing moments will not result in immediate recovery of the cache miss rates
either as the newly added cache resources need to be populated again. Using
our framework, we smoothen the cache miss rate graph and when an upsizing
14 S. Hu et al.
event happens before the (lasting) effect of our approach subsided, the cache
miss rate can improve again from a much lower point. For example, the recovery
can start from any point on the cache miss curve. In this manner, our approach
can greatly reduce the cache miss rates in a dynamic environment in which the
cache must be resized very quickly. This point is further strengthened by our
measured result that a transition period of about 1000–3000 cycles is adequate
to reach the main results of our approach.
7 Conclusions
In this paper, we presented a novel reconfigurable d-cache framework combined
with an adaptive R-LRU replacement policy without additional hardware over-
head. We demonstrated that our framework has the capability to maintain a low
miss rate with a transition period up to 6000 cycles while a period of 2000 cycles
is enough to achieve good results. Moreover, our approach prevents the sharp
miss rate increase as a result of the cache downsizing by on average between 10%
to 63%. The short periods in which we achieved our results can lead to comput-
ing systems that more frequently perform core resizing (and therefore also cache
resizing) in order to maintain a high level of responsiveness without sacrificing
performance too much. Finally, when our framework is used in a scenario in
which mode changes occur frequently, the improvement of cache performance is
further amplified.
Acknowledgement. We would like to thank Prof. Wong for his valuable suggestions
and kind help. We also thank TU Delft for their ρ-VEX platform.
References
1. Anjam, F., Wong, S., et al.: Simultaneous reconfiguration of issue-width and
instruction cache for a VLIW processor. In: Embedded Computer Systems
(SAMOS) (2012)
2. Zang, W., Gordon-Ross, A.: A survey on cache tuning from a power/energy per-
spective. ACM Comput. Surv. 45(3), 32 (2013)
3. Keramidas, G., Datsios, C.: Revisiting cache resizing. Int. J. Parallel Program.
43(1), 59–85 (2015)
4. Yang, S., Powell, M., et al.: Exploiting choice in resizable cache design to optimize
deep-submicron processor energy-delay. In: High Performance Computer Architec-
ture (2002)
5. Mittal, S., Zhang, Z.: EnCache: improving cache energy efficiency using a software-
controlled profiling cache. IEEE EIT (2012)
6. Beckmann, N., Sanchez, D.: Talus: a simple way to remove cliffs in cache perfor-
mance. In: High Performance, Computer Architecture (HPCA) (2015)
7. Wong, S., Van As, T., et al.: p-VEX: a reconfigurable and extensible softcore VLIW
processor. In: FPT 2008 (2008)
8. Anjam, F., Nadeem, M., et al.: Targeting code diversity with run-time adjustable
issue-slots in a chip multiprocessor. In: Design, Automation and Test in Europe
Conference Exhibition (DATE) (2011)
Improving the Performance of Adaptive Cache 15
9. Brandon, A., Wong, S.: Support for dynamic issue width in VLIW processors using
generic binaries. In: Design, Automation Test in Europe Conference Exhibition
(DATE) (2013)
10. Kharbutli, M., Sheikh, R.: LACS: a locality-aware cost-sensitive cache replacement
algorithm. IEEE Trans. Comput. 63, 1975–1987 (2014)
11. Guthaus, M., Ringenberg, J., et al.: MiBench: a free, commercially representative
embedded benchmark suite. In: 2001 IEEE International Workshop on Workload
Characterization, WWC-4, December 2001
12. Hill, M., Edler, J.: Dineroiv trace-driven uniprocessor cache simulator (2015)
13. Guo, Q., Sartor, A., et al.: Run-time phase prediction for a reconfigurable vliw
processor. Design, Automation Test in Europe Conference Exhibition (DATE)
(2016)
LP-P2 IP: A Low-Power Version of P2 IP
Architecture Using Partial Reconfiguration
1 Introduction
The Programmable Pipeline Image Processor (P2 IP) is a systolic Coarse-Grained
Reconfigurable Architecture (CG-RA) for real-time video processing embedded
in FPGA. It features low-latency systolic array inherent structures, runtime
reconfigurable data-path, high-performance CG operators and short compila-
tion times of software applications. Its data path, operating at the pixel clock
frequency, can deliver, after the initial latency of a 3-line pipeline, one processed
pixel per clock cycle [2–4]. The architecture processing core consists of identical
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 16–27, 2017.
DOI: 10.1007/978-3-319-56258-2 2
LP-P2 IP: A Low-Power Version of P2 IP Architecture 17
PE
PE-CD RI MC
RI-CD MC-CD
Fig. 1. Processing Element (PE) and its internal blocks. The main blocks are the
Pixel Processor (PP), Memory Controller (MC), Spatial Processor (SP), Reconfigurable
Interconnection (RI) and Configuration Decoder (PE-CD).
currents inside transistors and the dynamic power is caused by the switching
activity when charging and discharging the load capacitance C, as well as short-
circuit currents when transistors commute.
Dynamic power, as stated in (2), has linear dependency on the clock fre-
quency f and a quadratic dependency on the supply voltage V . In an FPGA,
the load capacitance depends on the number of logic and routing elements used.
The factor α is the activity or toggle rate of an element; it depends on the
topology and its input stimuli.
Pdynamic = α × C × V 2 × f (2)
3 Modifications on P2 IP
32 31 16 15 0
VLD ADDRESS DATA
Fig. 2. Configuration word. The Configuration Block reads the word when the VLD
bit is high. ADDRESS corresponds to the operator ID, and DATA is the configuration
info.
3.2 PR Applied on P2 IP
The runtime flexibility of P2 IP requires that the number of PEs and the provided
functionality be enough to withstand all possible stream operations. For that
reason, the number of PEs is defined before synthesis. Although, depending on
the video processing algorithm to be executed, PEs are not completely in use.
Considering, for instance, basic algorithms such as Edge Sharpening (Sharp),
Canny Edge Detection (Edge) or Harris Corner Detection (Corner ), they just
share some operations. Sharp uses just three PEs to data processing, while the
others use five (Edge) and seven (Corner ), respectively. In that context, unused
PEs still contribute to both, dynamic and static power consumption. For details
about mapping each application onto P2 IP refer to [3].
To make possible the execution of the three aforementioned applications,
seven PEs are defined. This number is chosen according to the Corner appli-
cation, which, among the three, requires the greatest amount of PEs [3]. At
runtime, the PEs or its content cannot simply be removed: it would interrupt
the video stream continuity. Thus, in addition to the regular content of a PE core
(all the blocks as shown in Fig. 1), a modified version (core+bypass) is proposed,
in which the output buffers the input, to ensure a continuous video stream before
removing the core. Indeed, the core of each PE is contained in a Reconfigurable
Region (RR), as suggested in [4]. So, seven RRs are defined in the FPGA area.
PEs are equal in size and content, hence, all RRs resource requirements are the
same. However, resources allocated to each RR may vary, depending on where
the RR is allocated in the FPGA area (and, consequently, the available resources
in the referred area).
To achieve the three mentioned applications examples using PR, three con-
figurations are defined:
– Sharp: RR1, RR2, RR3 in default configuration; RR4, RR5, RR6 and RR7
bypassed;
– Edge: RR1, ..., RR5 in default configuration; RR6 and RR7 bypassed;
– Corner : RR1, ..., RR7 in default configuration.
Figures 3, 4 and 5 show, respectively Sharp, Canny and Corner applications
mapped onto P2 IP using PR. The software-driven configuration mechanism is
responsible for activating the inputs, outputs and internal blocks of each PE.
4 Methodology
The new architecture is able to allocate resources (PEs) to reconfigurable regions
(RRs) defined in the FPGA area. Resources allocated to each RR can be of type
bypass or original PE core.
Fourteen partial bitstreams (Default RR1..7 and Bypass RR1..7, in Fig. 6)
are initially stored in an SD card. During boot, the ARM processor copies these
partial bitstreams to the DDR memory.
After that, the ARM also loads a full bitstream (the initial configuration
containing static and dynamic parts) before the FPGA starts running.
LP-P2 IP: A Low-Power Version of P2 IP Architecture 21
PE4
PE5
NE 2DC NE 2DC NE 2DC
PE6
(3x3) (L) (3x3) (L) (3x3) (L)
PE7
ALU ALU ALU
Z-n Z-n Z-n
(+) (+) (+)
iR
iG
iB oR
oG
oB
Fig. 3. Sharp application mapped onto P2 IP: the first three PEs are in default config-
uration; the four last are configured as bypass.
PE6
2DC ALU
PE7
2DC NE Con
(SV) (+) NMS Con Mir
(G) (3x3)
NE
Mir
(3x3) NE
NE NE
Z-n Thr (3x3) Thr
(5x5) (3x3)
2DC
Dir
(SH)
iGs
oGs
Fig. 4. Canny application mapped onto P2 IP: the first five PEs are in default config-
uration; the two last are configured as bypass.
By default, the Xilinx Zynq platform offers two options to load a bitstream
into the FPGA: the Internal Configuration Access Port (ICAP) or the Processor
Configuration Access Port (PCAP). The first one is in use, for a long time,
by the previous FPGA families [1,5,17,18]. It consists of an IP softcore and,
consequently, spends some FPGA resources. The PCAP interface is native, does
not consume any FPGA resources and uses a DMA engine [10]. This process
is more efficient than the one adopted by the previous Xilinx FPGA families,
since these generations did not use DMA natively, turning the partial bitstream
transfer slower [9] while forcing the designer to consume more FPGA resources
to allocate a custom DMA engine or the ICAP interface [17].
iGS oGS
2DC Thr ALU
(FDH) (&)
FPGA
video
RR 1 (PE1)
RR 7 (PE7)
input Input Output video
...
Register Register output
config
P2IP
input
Controller
ARM
PCAP
SD card
Default RR1
Default RR1
Default RR7
control
Bypass RR1
Bypass RR7
Bypass RR1 data
DDR . ..
. ..
reconfiguration
Default RR7 during boot
Bypass RR7
Fig. 6. P2 IP using PR: during boot, the ARM reads the partial bitstreams from the SD
Card and loads them into the DDR. On demand, during runtime, the partial bitstreams
are loaded from DDR into the RRs.
More details about the bitstream copy from the SD card to the DDR memory
and from the DDR memory to the PCAP interface (valid for the Xilinx 7-series
FPGAs) can be found on [19].
Since the purpose of this work is to reduce energy consumption, additional
logic must be minimized, therefore the PCAP interface was chosen to transfer
(static and partial) bitstreams from the memory to the FPGA, under the ARM
supervision. The ARM is also used to activate the inputs/outputs, internal inter-
connections and blocks of each PE via an AXI4-Lite [14] interface.
For details about how the configuration mechanism works refer to [3]. Since
all the video processing is done on the FPGA side, we have chosen to use a bare-
metal implementation on the ARM side, instead of using an Operating System.
Figure 6 shows the block diagram of the architecture using PR, detailing how
the ARM loads a partial bitstream into P2 IP.
5 Results
Table 1. Allocated resources, compared to the original implementation (left side) and
measured power, in mW (right side).
power consumption increasing. Using the ARM is a good alternative if the USB
Interface Adapter is not available.
In this work we have used the USB Interface Adapter. Fusion Power Digital
Design software, from Texas Instruments, links to the USB Interface Adapter
and gets voltage and current information, making possible to calculate the power
consumption. It is possible to define measurement parameters and acquisition
rate. Minimum acquisition rate is 10 ms, but it is important to highlight that the
USB Interface Adapter is plugged to a computer running Microsoft Windows,
which is not a Real Time Operating System (RTOS), and, thus, there is no
guarantee that the acquisition rate will be respected. Due to this restriction
during tests the minimum acquisition rate used was 100 ms. An advantage of
this method compared to the ARM reading current and voltage is that the first
does not interfere in the ARM power consumption [15].
Fig. 7. FPGA core current measurement circuit on ZC702 board. Current can be read
by the ARM processor or by an external hardware from Texas Instruments, both
through the I2 C bus.
The right side of Table 1 shows the measured power for the three configura-
tions using PR (third column), compared to the original implementation (second
column). The last column of the referred Table shows how much power savings
it is possible to achieve using PR into P2 IP. For each configuration 200 samples
have been acquired using a sample rate of 100 ms, totalizing a 20 s acquisition
(for each application). The values shown on Table 1 are the average of the 200
samples.
As can be seen in the previous Table, the power overhead for the Corner
algorithm, due to the extra partial reconfiguration logic, is negligible.
Table 2 shows the time necessary to change configurations. To load one partial
bitstream it is necessary: 2.381 ms, for RR1, RR2 or RR3; 2.404 ms, for RR4;
and 3.175 ms for RR5, RR6 and RR7.
To assure that, using PR, the system remains a real time one, the following
Equation is used:
ttotal = treconf ig + tconf ig (3)
where ttotal is the total reconfiguring latency, treconf ig is the time necessary to
apply PR to the RRs and tconf ig is the time necessary to configuring internal
PE blocks.
Time necessary to apply PR depends on how many RRs will be configured
and is described in (4):
treconf ig = tRRi (4)
where tRRi is the time necessary to apply PR to each RR.
Time necessary to apply PR to one RR depends on the external memory
(which stores the partial bitstream) access and also on the time to load the
partial bitstream on the respective RR and is shown in (5):
tRRi = tDDR + tloadP B (5)
To maintain the real time feature of the system the following Equation must
be respected:
ttotal < tf rame (6)
If (6) is respected then only one frame will be lost during reconfiguration,
using PR or not. If the time overhead introduced by PR is less than the frame
timing the extra time necessary to apply PR is admissible and does not imply
in additional delay, that is, the system remains real time.
It is necessary tconf ig = 0.27 μs for each operator to be configured. In terms
of latency the worst case is to change from Sharp to Corner, in which it is
necessary to apply PR to four RRs and configure 21 operators (see Fig. 5). In
this case ttotal = 11.935 ms. So, only one frame will be lost when applying PR.
When changing the configuration (the number of active PEs or event an internal
block) of the architecture (using PR or not) one frame will be lost. This work
proves that it is possible to apply PR without losing additional frames.
26 Á. Avelino et al.
6 Conclusions
In this article we have presented a low-power P2 IP architecture based on the
use of the PR strategy. The original architecture was extended to support PR:
processing elements were designated as PR components resulting on less than
5% resources overhead. To demonstrate the advantages of this novel architecture
in terms of power consumption, three image processing algorithms were mapped
and executed on both architectures. Power consumption comparison of original
and PR implementations has been carried out and attested that PR implemen-
tation leads to power savings of up to 45%. The worst-case scenario, which takes
into account the use of all available resources, implies an additional energy cost
of less than 1%. Furthermore, PR latency does not affect the real-time feature
of the system.
The PR strategy should not only be applied to lower the power consumption,
it also serves to combine multiple alternative implementations of PEs that can be
interchanged according to particular execution and quality requirements. Thus,
future work should investigate the balance between power saving and required
processing power.
Acknowledgments. The authors would like to thank the support from the Coordina-
tion of Superior Level Staff Improvement (CAPES), brazilian sponsoring agency, and
also the Electronics and Microelectronics Department from the University of Mons,
Belgium, for the support offered to the development of this work.
References
1. Cardona, L.A., Ferrer, C.: AC ICAP: a flexible high speed ICAP controller. Int. J.
Reconfigurable Comput. (2015). doi:10.1155/2015/314358
2. Possa, P.R., Mahmoudi, S.A., Harb, N., Valderrama, C., Manneback, P.: A multi-
resolution FPGA-based architecture for real-time edge and corner detection. IEEE
Trans. Comput. 63, 2376–2388 (2014). doi:10.1109/TC.2013.130
3. Possa, P., Harb, N., Dokládalová, E., Valderrama, C.: P2 IP: a novel low-latency
programmable pipeline image processor. Microprocess. Microsyst. 39, 529–540
(2015). doi:10.1016/j.micpro.2015.06.010
4. da Cunha Possa, P.R.: Reconfigurable low-latency architecture for real-time image
and video processing. UMONS (2013)
5. Liu, S., Pittman, R.N., Forin, A.: Minimizing partial reconfiguration overhead
with fully streaming DMA engines and intelligent ICAP controller. In: Microsoft
Research (2009)
6. Liu, S., Pittman, R.N., Forin, A.: Energy reduction with run-time partial recon-
figuration. In: Microsoft Research (2009)
7. Ihsen, A.: Conception de Systèmes Embarqués Fiables et Auto-réglables. Universit
de Valenciennes, Applications sur les Systèemes de Transport Ferroviaire (2016)
8. Becker, T., Luk, W., Cheung, P.Y.K.: Energy-aware optimization for run-
time reconfiguration. In: 18th IEEE Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM), pp. 55–62 (2010). doi:10.
1109/FCCM.2010.17
LP-P2 IP: A Low-Power Version of P2 IP Architecture 27
9. Blodget, B., Bobda, C., Huebner, M., Niyonkuru, A.: Partial and dynamically
reconfiguration of Xilinx Virtex-II FPGAs. In: Becker, J., Platzner, M., Vernalde,
S. (eds.) FPL 2004. LNCS, vol. 3203, pp. 801–810. Springer, Heidelberg (2004).
doi:10.1007/978-3-540-30117-2 81
10. Xilinx: Vivado Design Suite Tutorial - Partial Reconfiguration (2015)
11. Xilinx: ZC702 Evaluation Board for the Zynq-7000 XC7Z020 All Programmable
SoC User Guide (2015)
12. Srikanth, E.: Zynq-7000 AP SoC Low Power Techniques part 3 - Measuring ZC702
Power with a Standalone Application Tech Tip (2014)
13. Texas Instruments: USB Interface Adapter Evaluation Module User’s Guide (2006)
14. Xilinx: AXI4-Lite IPIF v3.0 (2016)
15. Nunez-Yanez, J.L., Hosseinabady, M., Beldachi, A.: Energy optimization in com-
mercial FPGAs with voltage, frequency and logic scaling. IEEE Trans. Comput.
65, 1484–1493 (2016). doi:10.1109/TC.2015.2435771
16. Xilinx: Partial Reconfiguration of a Hardware Accelerator on Zynq-7000 All Pro-
grammable SoC Devices (2013)
17. Silva, C.A.A., Neto, A.D.D., Oliveira, J.A.N., Melo, J.D., Barbalho, D.S., Avelino,
A.M.: Definition of an architecture to configure artificial neural networks topologies
using partial reconfiguration in FPGA. IEEE Latin Am. Trans. 15, 2094–2100
(2015)
18. Dondo, J.D., Barba, J., Rincón, F., Moya, F., López, J.C.: Dynamic objects: sup-
porting fast and easy run-time reconfiguration in FPGAs. J. Syst. Archit. 59, 1–15
(2013). doi:10.1016/j.sysarc.2012.09.001
19. Muhammed, A.K., Rudolph, P., Gohringer, D., Hubner, M.: Dynamic and par-
tial reconfiguration of Zynq 7000 under Linux. In: 2013 International Conference
on Reconfigurable Computing and FPGAs, ReConFig 2013 (2013). doi: 10.1109/
ReConFig.2013.6732279
NIM: An HMC-Based Machine
for Neuron Computation
1 Introduction
Neuron simulation has become a popular tool used to try to reproduce human
brain’s behavior, and a resource used to solve problems that require a learn-
ing capability from the system. For a given neuron in a Neural Network (NN),
its Natural Time Step (NTS) defines the maximum time it has to read data
from its neighbors, operate over input data, and output the resulted computa-
tion to subsequent neurons. Currently, the NTS for an Inferior-Olivary Nucleus
(ION) neural arrangement is 50 µs [1]. To keep up with system constraints, today
neural simulators aim to explore available application parallelism by using HPC
devices, usually composed of a mix of multi-core processors [2], GPU devices
[3], and accelerator units based on FPGAs [4]. However, those setting are highly
expensive and not energy efficient. A significant part of system energy consump-
tion comes from data movement throughout the whole system [5]. For a neuron,
data from its neighbors travel throughout the entire memory system until it
gets to the computational target core. Therefore, a neuron simulation system
presents a small rate of memory reuse, since only data from a single layer would
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 28–35, 2017.
DOI: 10.1007/978-3-319-56258-2 3
NIM: An HMC-Based Machine for Neuron Computation 29
The work presented in [11] provides plain FUs capable of computing data directly
from main memory. In our work, more complex FUs have been implemented
30 G.F. Oliveira et al.
HMC vaults
Vault controller
B6 B7 B6 B7 B6 B7 Write Read
Memory buffer buffer
par ons B4 T B5 B4 T B5 B4 T B5
(DRAM S S S DRAM sequencer
layers) B2 V B3 B2 V B3 B2 V B3
B0 B1 B0 B1 B0 B1 NIM
Register
Vault 0 Vault 1 Vault 31 Data bank
...
Logic logic logic logic
layer
Complex
NIM NIM NIM Processing Units
Cross-bar switch Stat. Inst.
NIM inst.
16 16 16 16 NIM
Links sequencer
lanes lanes lanes lanes Stat/Rqst
This section describes all performed experiments and its following results. To
better understand all presented results, it is important to notice that the total
number of neurons simulated in a NN is equivalent to the product of the number
of neurons per layer N/L by the total number of layers L.
NIM: An HMC-Based Machine for Neuron Computation 31
3.1 Methodology
150
8 Cores NIM
125
100
Time (us)
75
50
25
0
Layers 512 1024 128 192 256 16 32 64 8 16 32
32 Neurons/Layer 64 Neurons/Layer 128 Neurons/Layer 256 Neurons/Layer
2.00
8 Cores NIM
1.50
TIme (ms)
1.00
0.50
0.00
Layers 2048 4096 1024 2048 4096 256 512 1024 64 128 256
32 Neurons/Layer 64 Neurons/Layer 128 Neurons/Layer 256 Neurons/Layer
Izhikevich Application: Figure 2 depicts the results for NNs using Izhikevich
equations. As the amount of N/L increases, the number of connections between
32 G.F. Oliveira et al.
125
100
Time (us)
75
50
25
0
Layers 64 72 32 40 48 4 8 12 2 4
32 Neurons/Layer 64 Neurons/Layer 128 Neurons/Layer 256 Neurons/Layer
2.00
8 Cores NIM
1.50
TIme (ms)
1.00
0.50
0.00
Layers 1024 2048 256 512 768 128 192 16 48 64
32 Neurons/Layer 64 Neurons/Layer 128 Neurons/Layer 256 Neurons/Layer
Figure 3 shows the simulating results for the more relaxed scenario. When
the time limit ranges to 1ms, the performance of the NIM mechanism showed
the same behavior for N/L configured with up to 32 neurons. However, when
the NN is configured with more than 64 N/L, the number of layers becomes less
significant. The baseline can represent a maximum of 131072 neurons (64 N/L,
NIM: An HMC-Based Machine for Neuron Computation 33
2048 L) while our NIM mechanism is capable of simulating the same amount
of neurons at half the baseline time. For 1ms, the NIM simulated up to 262144
neurons in total (64 N/L, 4096 L).
1.2
8 cores NIM
Energy Consumption (%)
0.8
0.6
0.4
0.2
0
Layers 40 72 768 2048
64 Neurons/Layer 32 Neurons/Layer 64 Neurons/Layer 32 Neurons/Layer
50us 1ms
To measure system energy consumption, we used the McPat [14] tool, configured
to use 32 nm technology for both systems. We have chosen to estimate energy
consumption for HH applications since their results showed a more heterogeneous
scenario. We compared the baseline and NIM configurations that represented the
maximum number of neurons simulated in each case.
Figure 6 depicts the percentage of energy consumed by our system when
compared to the baseline. One can notice that the amount of N/L impacts the
energy reduction our system can provide. For NNs with more N/L, our device
mitigates unnecessary data movement from main memory to cache devices, since
more N/L represent less data reuse. In contrast, increasing the number of layers
reduces NIM impact over energy consumption, once that the number of hit access
at cache memories will increase.
34 G.F. Oliveira et al.
4 Related Work
In this section, we list several works that aim to simulate NNs. Each work tar-
gets distinct neuron models and networks topologies, making it not possible to
compare the presented work directly with others. However, our evaluation metric
(number of neurons in determined simulation time) can be used to approximate
our gains over previous ones. We have classified the presented related works into
four categories: GP-based, GPU-based, FPGA-based, and PIM-based.
In the first class, one could find works as [15] and [2]. Despite the large
processing capability provided by these works, they both suffer from the same
issue: neuron communication. In those cases, it is not possible to simulate NN
within the natural time step.
[3] is an example of GPU-based neuron simulators. However, the timing con-
straint needed to represent biologically accurate NN on a large scale is a challenge
for GPUs. Besides, GPUs are inefficient regarding energy and power.
In the third category, one could fit an extended number of works, as [4,16],
and [17]. Even though using dedicated hardware to simulate large NN is an
effective approach, it is not as flexible as the other ones cited here.
Finally, similarly to our work, [18] aims to accelerate deep learning appli-
cations by exploiting PIM capabilities. In their work, the authors present an
architecture composed of four HMC devices incremented with CPU and GPU
modules at their logic layer. Even though [18] achieved good results, their module
is computationally expensive, and it is not energy efficient as our device.
5 Conclusions
In this paper, we presented Neuron In-Memory (NIM), a computational mech-
anism able to simulate large Neural Networks (NNs). Our work is based on the
vector processing capabilities extracted from NN applications that can be imple-
mented directly in memory, taking advantages of the broad bandwidth available
in modern 3D-stacked memory devices. To conclude, the presented NIM module
is capable of simulating NN of significant sizes in an embedded environment.
When compared with traditional multi-core environments, our mechanism pro-
vides system acceleration for large NN, while reducing overall energy consump-
tion. In future works, we aim to extend our device to enable networks with layers
of different sizes, thereby reducing data movement in small NN topologies.
References
1. De Gruijl, J.R., Bazzigaluppi, P., de Jeu, M.T., De Zeeuw, C.I.: Climbing fiber
burst size and olivary sub-threshold oscillations in a network setting. PLoS
Comput. Biol. 8(12), e1002814 (2012)
2. Hines, M., Kumar, S., Schürmann, F.: Comparison of neuronal spike exchange
methods on a Blue Gene/P supercomputer. Front. Comput. Neurosci. 5, 49 (2011)
NIM: An HMC-Based Machine for Neuron Computation 35
3. Wang, M., Yan, B., Hu, J., Li, P.: Simulation of large neuronal networks with
biophysically accurate models on graphics processors. In: The 2011 International
Joint Conference on Neural Networks (IJCNN), pp. 3184–3193, July 2011
4. Smaragdos, G., Isaza, S., van Eijk, M.F., Sourdis, I., Strydis, C.: FPGA-based
biophysically-meaningful modeling of olivocerebellar neurons. In: Proceedings of
the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, FPGA 2014, pp. 89–98. ACM, New York (2014)
5. Zenke, F., Gerstner, W.: Limits to high-speed simulations of spiking neural
networks using general-purpose computers. Front. Neuroinform. 8, 76 (2014).
http://journal.frontiersin.org/article/10.3389/fninf.2014.00076
6. Balasubramonian, R., Chang, J., Manning, T., Moreno, J.H., Murphy, R., Nair,
R., Swanson, S.: Near-data processing: insights from a MICRO-46 workshop. IEEE
Micro 34(4), 36–42 (2014)
7. Hybrid Memory Cube Consortium. Hybrid Memory Cube Specification Rev. 2.0
(2013). http://www.hybridmemorycube.org/
8. Lee, D.U., Hong, S., et al.: 25.2 a 1.2v 8GB 8-channel 128GB/s high-bandwidth
memory (HBM) stacked DRAM with effective microbump I/O test methods using
29nm process and TSV. In: 2014 IEEE International Solid-State Circuits Confer-
ence Digest of Technical Papers (ISSCC), pp. 432–433, February 2014
9. Hodgkin, A.L., Huxley, A.F.: A quantitative description of membrane current and
its application to conduction and excitation in nerve. Bull. Math. Biol. 52(1),
25–71 (1990)
10. Izhikevich, E.M.: Simple model of spiking neurons. Trans. Neur. Netw. 14(6),
1569–1572 (2003)
11. Alves, M.A.Z., Diener, M., Santos, P.C., Carro, L.: Large vector extensions inside
the HMC. In: 2016 Design, Automation Test in Europe Conference Exhibition
(DATE), pp. 1249–1254, March 2016
12. Santos, P.C., Oliveira, G.F., Tome, D.G., Alves, M.A.Z., Almeida, E.C., Carro, L.:
Operand size reconfiguration for big data processing in memory. In: 2017 Design,
Automation Test in Europe Conference Exhibition (DATE), March 2017
13. Alves, M.A.Z., Diener, M., Moreira, F.B., Villavieja, C., Navaux, P.O.A.: Sinuca:
a validated micro-architecture simulator. In: High Performance Computation Con-
ference (2015)
14. Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P.: The
McPAT framework for multicore and manycore architectures: simultaneously mod-
eling power, area, and timing. ACM Trans. Archit. Code Optim. (TACO) 10(1),
5 (2013)
15. Sakai, K., Sajda, P., Yen, S.-C., Finkel, L.H.: Coarse-grain parallel computing
for very large scale neural simulations in the NEXUS simulation environment.
Computers in Biology and Medicine, vol. 27(4), 257–266 (1997)
16. Zhang, Y., Mcgeehan, J.P., Regan, E.M., Kelly, S., Nunez-Yanez, J.L.: Biophysi-
cally accurate foating point neuroprocessors for reconfigurable logic. IEEE Trans-
act. Comput. 62(3), 599–608 (2013)
17. Beuler, M., Tchaptchet, A., Bonath, W., Postnova, S., Braun, H.A.: Real-time
simulations of synchronization in a conductance-based neuronal network with a
digital FPGA hardware-core. In: Villa, A.E.P., Duch, W., Érdi, P., Masulli, F.,
Palm, G. (eds.) ICANN 2012. LNCS, vol. 7552, pp. 97–104. Springer, Heidelberg
(2012). doi:10.1007/978-3-642-33269-2 13
18. Xu, L., Zhang, D.P., Jayasena, N.: Scaling deep learning on multiple in-memory
processors. In: WoNDP: 3rd Workshop on Near-Data Processing (2015)
VLIW-Based FPGA Computation Fabric
with Streaming Memory Hierarchy
for Medical Imaging Applications
Joost Hoozemans(B) , Rolf Heij, Jeroen van Straten, and Zaid Al-Ars
1 Introduction
In contemporary medical imaging platforms, complexity of image processing
algorithms is steadily increasing (in order to improve the quality of the out-
put while reducing the exposure of the patients to radiation). Manufacturers of
medical imaging devices are starting to evaluate the possibility of using FPGA
acceleration to provide the computational resources needed. FPGAs are known
to be able to exploit the large amounts of parallelism that is available in image
processing workloads. However, current workflows using High-Level Synthesis
(HLS) are problematic for the medical application domain, as it impairs program-
mability (increasing time-to-market) and maintainability. Additionally, some of
the image processing algorithms used are rather complex and can yield varying
quality of results. Therefore, in this paper, we propose a computation fabric on
the FPGA that is optimized for the application domain, in order to provide
acceleration without sacrificing programmability. By analyzing the structure of
the image processing workload type (essentially a pipeline consisting of multiple
filters operating on the input in consecutive steps), we have selected a suitable
processing element and designed a streaming memory structure between the
processors.
The image processing workload targeted in this paper consists of a number
of filters that are applied to the input data in sequence. Each filter is a stage
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 36–43, 2017.
DOI: 10.1007/978-3-319-56258-2 4
VLIW-Based FPGA Computation Fabric for Medical Imaging 37
in the image processing pipeline. The input stage of a filter is the output of the
previous stage - the stages stream data to each other. Making sure these transfers
are performed as efficiently as possible is crucial to provide high throughput.
The processing element used in this work is based on a VLIW architecture.
These type of processors are ubiquitous in areas such as image and signal process-
ing. They are known for their ability to exploit Instruction-Level Parallelism
(ILP) while reducing circuit complexity (and subsequently power consumption)
compared to their superscalar counterparts. In the medical imaging domain,
power consumption is not a main concern, but as image processing workloads
can be divided into multiple threads easily, a reduction in area utilization will
likely result in an increase in total throughput.
The remainder of this paper is structured as follows: Sect. 2 discusses related
work, Sect. 3 discusses the implementation details, Sects. 4 and 5 present the
evaluation and results, and Sect. 6 provides conclusions and future work.
2 Related Work
3 Implementation
The computation fabric developed in this work consists of two facets; the process-
ing elements and the memory hierarchy, as shown in Fig. 1. The implementation
of both will be discussed in this section. Then, the process of designing a full
platform using these components is discussed.
38 J. Hoozemans et al.
Fig. 1. Organization of a single stream of processing elements (Stream unit) and the
streaming connections that link the data memories. Each processor can access the
memory of its predecessor. Each processor’s memories and control registers can be
accessed via a bus that runs on a low clock frequency to prevent it from becoming a
timing-critical net.
This section describes the design and implementation of our fabric. The proces-
sor cores in the fabric are derived from the ρ-VEX processor [10]. The ρ-VEX
processor is an VLIW processor based on the VEX ISA introduced by Fisher
et al. [11]. The ρ-VEX processor has both run-time and design-time reconfig-
urable properties, giving it the flexibility to run a broad selection of applications
in an efficient way.
Image processing tasks are highly parallelizable in multiple regards; (1) The
code is usually computationally dense, resulting in high ILP, and (2) Every pixel
can in theory be calculated individually and it is easy to assign pixels to threads
(by dividing the image into blocks). In other words, there is an abundance of
Thread-Level Parallelism (TLP). Exploiting TLP is usually more area efficient
than exploiting ILP - increasing single-thread performance comes at a high price
in power and area utilization and will quickly show diminishing returns. This is
why GPUs exploit TLP as much as possible by using many small cores. There-
fore, the processing elements of our fabric will use the same approach and we will
use the smallest 2-issue VLIW configuration as a basis. This will still allow it to
exploit ILP by virtue of having multiple issue slots and a pipelined datapath.
By placing multiple instances of our fabric on an FPGA, TLP can be
exploited in two dimensions; by processing multiple blocks, lines or pixels
(depending on the filter) concurrently, and by assigning each step in the image
processing pipeline to a dedicated core (pipelining on a task level in contrast to
the micro-architectural level).
To explore the design space of the processor’s pipeline organization, we have
measured code size and performance of a 3 × 3 convolution filter implemented in
C. This convolution code forms a basis with which many operators can be applied
to an image depending on the kernel that is used (blurring, edge detection, sharp-
ening) so it is suitable to represent the application domain. The main loop can
be unrolled by the compiler using pragmas. Figure 2 lists the performance using
different levels of loop unrolling for different organizations of a 2-issue ρ-VEX
VLIW-Based FPGA Computation Fabric for Medical Imaging 39
pipeline; the default pipeline with 5 stages and forwarding, one with 2 additional
pipeline stages to improve timing, and one using the longer pipeline and with For-
warding (FW) disabled to further improve timing and decrease FPGA resource
utilization. Loop unrolling will allow the compiler to fill the pipeline latency with
instructions from other iterations. The performance loss introduced is reduced
from 25% to less than 2% when unrolling 8 times. Additionally, disabling for-
warding reduces the resources utilization of a core allowing more instances to be
placed on the FPGA (see Fig. 3).
Execution time (Mcycles)
150
126
104 101
99
100 92 92 93
88 88 87 86 88
5-stage Forwarding
7-stage Forwarding
50 7-stage no Forwarding
0
0 2 4 8
Loop Unroll factor
data memory (making it available for reading by the next core in the stream).
The memory blocks are implemented using dual-port RAM Blocks on the FPGA.
Each port can sustain a bandwidth of one 32-bits word per cycle per port, so
both processors connected to a block (current, next) can access a block without
causing a stall. The blocks are connected to the processors by means of a simple
address decoder between the memory unit and the data memories.
The first and last core should be connected to DMA (Direct Memory Access)
units that move data to and from input and output frame buffers (eventually
going off-board).
3.3 Platform
The VHDL code of the components is written in a very generic way and there
are numerous parameters that can be chosen by the designer. First of all, the
ρ-VEX processor can be configured in terms of issue width, pipeline config-
uration, forwarding, traps, trace unit, debug unit, performance counters, and
caches. Secondly, there is an encompassing structure that instantiates processors
in streams. The number of streams and length per stream are VHDL generics.
4 Experimental Setup
Since the target application of the designed system is related to medical image
processing, an X-ray sample image is used as input for the evaluation. Typi-
cal medical imagers work with images that have a size of 1000 by 1000 pixels.
The dimensions of our benchmark images are 2560 by 1920 pixels. The image
is resized to other dimensions in order to determine the scalability of system
performance. Each pixel is represented by a 32-bit value (RGBA). Using a tech-
nique described in the following section, the image may be scaled down to 1280
by 960 and 640 by 480 pixels.
A workload of algorithms based on a typical medical image processing
pipeline is used. The first step in the image processing pipeline is an interpolation
algorithm used to scale the size of the source image. The bi-linear and nearest
neighbor interpolation algorithms both have the same computational complexity
making them equally feasible. Because of its slightly higher flexibility, we select
the bi-linear interpolation algorithm for the evaluation. Secondly, a gray scaling
algorithm is applied. This algorithm is selected because it operates on single
pixels in the input dataset. The third stage is a convolution filter that sharpens
the image, followed by the final stage, an embossing convolution filter.
5 Evaluation Results
5.1 Resource Utilization
We have synthesized the platform using various configurations targeting the
Xilinx VC707 evaluation board. As stated, the pipeline organization of the
processing elements has influence on the resource utilization and timing. In
VLIW-Based FPGA Computation Fabric for Medical Imaging 41
Figs. 3 and 4 options have been evaluated using the standard synthesis flow
(unconstrained). With forwarding enabled, the platform completely fills the
FPGA using 64 cores. When forwarding is disabled, this can be increased to
75.
Additionally, we have performed a number of runs where we created sim-
ple placement constraints that steered the tool towards clustering the cores per
stream so that they are aligned on the FPGA in accordance with their stream-
ing organization. A single stream consisting of 4 cores achieves an operating
frequency of 200 MHz. Using 16 streams, timing becomes somewhat more diffi-
cult as the FPGA fabric is not homogeneous (some cores will need to traverse
sections of the chip that are reserved for clocking, reconfiguration and I/O logic,
and the distribution of RAM Blocks is not completely uniform). Still, this con-
figuration achieves an operating frequency of 193 MHz at 80% LUT utilization,
leaving room for interfacing with off-board electronics.
13.1
Execution time (µs)
101
9.01
8.2
7.24 7.12
re re ore cor
e
cor
e
4 -co 4-co 5c 5 4
e6 e6 e7 e7 e6
tag tag tag tag tag
5-s 7-s 5-s 7-s 7-s
FW FW FW FW FW
no no no
Platform
Fig. 4. Execution times of a convolution 3 × 3 filter for the platforms in the design-
space exploration as listed in Fig. 3 using 8x loop unrolling (from Fig. 2).
3,016
377
165
2 95
10
24
6 Conclusion
References
1. Hoozemans, J., Wong, S., Al-Ars, Z.: Using VLIW softcore processors for image
processing applications. In: 2015 International Conference on Embedded Computer
Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 315–318. IEEE
(2015)
2. Stevens, D., Chouliaras, V., Azorin-Peris, V., Zheng, J., Echiadis, A., Hu, S.:
BioThreads: a novel VLIW-based chip multiprocessor for accelerating biomedical
VLIW-Based FPGA Computation Fabric for Medical Imaging 43
image processing applications. IEEE Trans. Biomed. Circuits Syst. 6(3), 257–268
(2012)
3. Nowatzki, T., Gangadhan, V., Sankaralingam, K., Wright, G.: Pushing the lim-
its of accelerator efficiency while retaining programmability. In: 2016 IEEE Inter-
national Symposium on High Performance Computer Architecture (HPCA), pp.
27–39. IEEE (2016)
4. Putnam, A., Caulfield, A.M., Chung, E.S., Chiou, D., Constantinides, K.,
Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G.P., Gray, J., et al.: A recon-
figurable fabric for accelerating large-scale datacenter services. IEEE Micro 35(3),
10–22 (2015)
5. Ovtcharov, K., Ruwase, O., Kim, J.-Y., Fowers, J., Strauss, K., Chung, E.S.: Accel-
erating deep convolutional neural networks using specialized hardware. Microsoft
Research Whitepaper, vol. 2 (2015)
6. Russo, L.M., Pedrino, E.C., Kato, E., Roda, V.O.: Image convolution processing:
a GPU versus FPGA comparison. In: 2012 VIII Southern Conference on Program-
mable Logic, pp. 1–6, March 2012
7. Wang, P., McAllister, J., Wu, Y.: Soft-core stream processing on FPGA: an FFT
case study. In: 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing, pp. 2756–2760, May 2013
8. Wang, P., McAllister, J.: Streaming elements for FPGA signal and image process-
ing accelerators. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24, 2262–2274
(2016)
9. Bardak, B., Siddiqui, F.M., Kelly, C., Woods, R.: Dataflow toolset for soft-core
processors on FPGA for image processing applications. In: 2014 48th Asilomar
Conference on Signals, Systems and Computers, pp. 1445–1449, November 2014
10. Wong, S., Anjam, F.: The Delft reconfigurable VLIW processor. In: Proceedings
of 17th International Conference on Advanced Computing and Communications,
(Bangalore, India), pp. 244–251, December 2009
11. Fisher, J.A., Faraboschi, P., Young, C.: Embedded Computing: A VLIW App-
roach to Architecture, Compilers, and Tools. Morgan Kaufmann Publishers, San
Francisco (2005). 500 Sansome Street, Suite 400, 94111
Embedded Computing and Security
Hardware Sandboxing: A Novel Defense
Paradigm Against Hardware Trojans
in Systems on Chip
1 Introduction
To tackle system complexity, and reduce costs and time-to-market system-on-
chip (SoC) design, third-party Intellectual Property (IP) cores are used as inte-
gral parts of SoC design. Major parts of the IP design and IC production are
outsourced to non-trusted facilities distributed across the globe, thus opening
the door for Trojan insertion. Hardware Trojan insertion into an IC can occur
at any stage of the IP integration process (3PIP) [5,16], including the speci-
fication, design, verification and manufacturing stages. Approaches to Trojan
mitigation in SoCs have been so far statical using intense simulation, verifica-
tion, and physical tests to detect the presence of malicious components before
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 47–59, 2017.
DOI: 10.1007/978-3-319-56258-2 5
48 C. Bobda et al.
system deployment. While statical methods take place at all levels of the inte-
gration process, post-fabrication testing based on side-channel observation have
so far received more attention in the research community. The number of test
patterns needed to activate with certainty potential hidden Trojans is very large
for complex IPs and SoCs with dozens of inputs, outputs, states, and mem-
ory blocks, thus limiting the effectiveness of static testing methods. Run-time
approaches such as [13] that have been proposed to monitor signal behavior
and detect potential Trojan rely solely on using checkers and do not address
generalization.
In this work, we propose a novel approach, the Hardware Sandboxing for
Trojan mitigation in SoCs. Our approach is based on the well known concept of
sandbox already in use in software, whose goal is to contain the execution and
resources needed by components of non-trusted pieces of code in isolated envi-
ronments, and deploying guards to prevent damaging actions to the rest of the
system. Isolation of malicious IPs can increase system security while reducing
fabrication costs, pre-deployment verification and testing efforts. Our concept
will be enforced by dividing the system into a trusted area and a non-trusted
area. Components in the trusted area are designed under strict control of the
system integrator (e.g. the military) and trustworthy partners. These compo-
nents are assumed to be safe and can access any system resource. Components
in the non-trusted area are designed by non-trusted sources, and because they
may contain hardware Trojans, they must be placed in a sandbox along with
virtual resources they need. Trojans can be hidden in IPs and ICs, but as long as
they do not manifest, the system can be considered secured. The rationale of our
work is therefore the same as fault-tolerant systems, namely to design and build
systems along with dynamic methods that are capable of detecting manifestation
of Trojans at run-time and prevent potential damage to the system. To the best
of our knowledge, this is the first work that addresses security in systems-on-chip
through a containment of potential malicious components into sandboxes, which
includes resources needed by the components in virtualized form, along with
rule enforcement modules to detect malicious activities at run-time and prevent
damage to the system.
The rest of the paper is organized as follows. Section 2 presents a short review
of existing hardware Trojan mitigation methods. In Sect. 3, we present a general
organization of SoC devices for a secured integration of non-trusted components.
Using software as reference, sandboxing concepts and their feasibility in SoCs is
investigated in Sect. 3.1. We then devise the structure of a Hardware Sandbox
in Sect. 3.2, which leads to a design flow that starts with a systematic character-
ization of security properties and automatic generation of Hardware Sandboxes
in Sect. 4. Our method is validated in Sect. 5 with examples from the trust-
hub (www.trust-hub.com) benchmark leading to 100% protection of the system.
Section 6 concludes the work and provides some indications of our future work.
Hardware Sandboxing 49
2 Related Work
A comprehensive state of the art review of hardware Trojan mitigation
approaches is provided in [5]. Thereafter protection and countermeasures can
be done at three levels: at design time, at test time before deployment, and dur-
ing system operation. Design time approaches mitigate Trojans either by hiding
functional or structural properties of IPs to potential Trojan attackers through
modification of IPs and ICs operation [7], thus making it difficult to insert Tro-
jans in IPs, or by filling all non-used resources with non-functional components to
prevent their use for Trojan insertion [19]. Side channel analysis has been widely
investigated. It assumes that additional circuits needed for Trojan implementa-
tion and monitoring of activation conditions will have an observable impact on
physical properties of the IC such as power behavior [18], area [3], temperature
profile [8], path delays [12], or a combination of many physical parameters [6].
Deviation from behavior of a golden and Trojan free model is interpreted as
a Trojan activity. Increasingly, verification approaches are being used to ensure
correctness of some trust properties [15,20]. The idea is to characterize proper IP
behavior and exercise functional verification with high coverage factor to catch
deviations from normal IP behavior. One main problem with static approaches
is the need for a golden model. Hardware Trojans are inserted most of the time
because companies buy COTS and IPs that they cannot design in house, thus
the non-existence of a golden model. Even in the presence of extensive tests
and functional verification, activating test patterns may still not be exercised at
testing time.
In this work, we are more interested in run-time approaches that can dynam-
ically understand and assert IPs’ properties to identify Trojans and prevent
potential damage to critical systems. Online methods that rely on side chan-
nel analysis have the advantage of monitoring all devices’ behavior at run-time
and are therefore able to catch Trojans as they unfold. However, they still need
a physical profile that only a golden model can provide. Security Monitoring
has been discussed in [4] as a means to check signal behavior at run-time and
identify deviation that might be attributed to malicious activities. The idea is
to use assertions as a mean to describe signal behavior along with reconfigu-
ration for reuse of area needed for the checker. Unfortunately, further details
were not provided on conceptual and implementation realization of such strat-
egy. A checker based on the use of parity information for online verification of
potential security deviation has been presented in [13]. The checker is a classic
parity checker, protected by a randomization procedure to prevent attacks from
potential Trojans. Even though the authors achieved 99.98% success rate, no
systematic approach has been provided for the design of generalized checkers.
In [11] an isolation mechanism was presented with the goal to monitor and ana-
lyze traffic flow between an embedded device and the network for detection of
potential DDoS activities. As in previous case, the approach does not involved
virtual resource, which is a main component of hardware sandboxes considered
here.
50 C. Bobda et al.
Sandboxed
IP
Peripheral
IP
to system resources including communica-
tion components, peripherals and memory,
and one or more non-trusted regions in Fig. 1. Non-trusted IP integration in
which components execute in a sandbox. secured SoC using hardware sand-
The trusted zone is tightly controlled by boxing.
the system integrator and all resources are
developed only by trusted contractors. Components Off The Shelf (COTS) and
IPs designed by non-trusted contractors are only given indirect access to system
resources over the sandbox. The proposed approach can be realized at all levels
of the chip design cycle.
At system specification and register transfer level (RTL) implementation lev-
els, the integrator will design the sandbox along with all resources under tight
control and provide an interface to IP designers to integrate their IPs in the SoC.
At the manufacturing level, split-manufacturing process [13] can then be used
to manufacture the trusted areas and sandbox on one hand and the non-trusted
parts separately in different facilities. The system-on-chip of Fig. 1 features a
processor, memory, peripherals and two hardware accelerators in the trusted
area. There are four non-trusted IPs encapsulated in three sandboxes with two
IPs each using one sandbox exclusively and two IPs sharing a sandbox as a result
of resource optimization.
Feasibility. The use of sandboxes between IPs and the rest of the system comes
at the cost of performance and resource overhead. However, this is not an issue
in today’s SoCs as the evaluation in Sect. 5 will prove. Despite the high speed of
Hardware Sandboxing 51
Like many other technologies, such as network on chip, that originated from soft-
ware before finding their way into hardware, we will first look into the details
of sandboxing in software and devise a structure that fulfills hardware require-
ments. We rely on the taxonomy provided in [17], which places sandboxes in one
of the following categories, depending on their operation mode.
In-line Reference Monitor. This approach inserts resource access policies in the
code of the non-trusted IP, which guarantees the enforcement of security policies
even in case of bugs. Many verification tools allow for the insertion of assertions in
IP specification for the purpose of verification only. While synthesizable assertion
components are provided in libraries like OVL, they target a more coarse-grained
integration at the interface of components. Extension of in-line reference monitor
to the interface of IPs is more attractive for non-trusted IPs, many of which are
COTS where the integrator has no access to the internals and therefore limits
the interaction to the interface.
System Call Sandbox. Here, applications within the sandbox access system
resources using system calls, which are caught and executed by the VM or the
sandbox manager. This approach is similar to the previous in-line reference mon-
itor, with the only difference being that the emulation takes place at the interface
of the application and not within the code lines. This approach can be used to
contain the execution of subsystems with processor and code used to access
system resources.
52 C. Bobda et al.
With the previous discussion, we are now in position to devise the structure of
our hardware sandbox (Fig. 2). The goal is to provide an environment with tools
and capabilities for one or more non-trusted IPs to execute without jeopardizing
secure parts of the system. We therefore propose the following components for
a hardware sandbox.
Checkers. One or more checkers used for run-time enforcement of security rules
defined by the system integrator at compile time. A checker is devised from the
properties of an IP component in the sandbox and can be limited to only a
subset of IP signals and properties for overhead reduction.
Virtual Resources. The concept of the sandbox requires that resources needed
by IPs are provided in virtual form within the sandbox, where they can be used
by an IP without damaging the rest of the system. In the sandbox of Fig. 2, the
virtual UART (V-UART), virtual USB (V-USB), and virtual VGA (V-VGA)
along with virtual memory V-MEM are provided to the IP in the sandbox. The
main advantage here is that the interface between virtual resources within the
sandbox and physical resources follows a secured protocol and can never cause
a denial-of-service. Any attempt from a Trojan to alter a peripheral will be
nullified by the virtual peripheral.
4 Design Flow
The structure of the hardware sandbox devised in
the previous section gives us a design flow consist-
ing of (1) selecting the virtual resources to be used
along with their connection to IPs in the sandbox, (2)
generating checkers for IPs’ signals and behavior to
observe and rules to enforce over time, and (3) design- Fig. 3. Hardware sandbox
ing the sandbox controller to map virtual resources design flow.
to physical ones and to control the flow of data to
and from the sandbox. While it is possible to perform all those tasks manually
on small examples, the complexity of today’s designs requires tools that can
automatize the whole design process and produce efficient and secure systems.
The design flow we propose is illustrated in Fig. 3 and starts with the specifica-
tion of IPs in the sandbox, the security properties and rules to be enforced at
run-time on selected signals, and the resources to be virtualized in the sandbox.
The flow produces the sandbox.
Controller. The controller can be written by the user or generated from a behav-
ioral description of the components in the sandbox. It must include actions to
perform in case of security rule violation, reporting IP activities to the embedded
processor and arbitrate data exchange between virtual resources and correspond-
ing physical resources. The controller can vary from a simple finite state machine
to a small processor that runs complex code in the sandbox.
Hardware Sandboxing 55
This pattern follows the RS232 protocol used in the UART transmitter. Ini-
tially, xmitH is asserted to begin the transmission. For the next 16 clock cycles,
uart tx is always unasserted. Following that, the actual data is serialized and
transmitted, starting from the LSB of xmit data. Each of these bits being trans-
mitted assumes 16 clock cycles. Once that is complete, the next 31 clock cycles
of uart tx are high followed by the uart xmit doneH signal being asserted to
alert that the transmission process is complete. By checking if each of these sig-
nals is behaving as the protocol is given, we can build test expression for the
OVL Cycle Sequence. Additional checkers, such as one for the UART receiver,
are created in the same manner.
Using the signal, fire, generated from the OVL component, we attach a series
of memory-mapped, status and configuration registers to our sandbox to allow
the processor the ability to read if a non-trusted IP is misbehaving, i.e. if our
register reads 1 instead of 0. With this knowledge, appropriate action can thus
be taken by the user.
Table 1. Evaluation of our hardware checkers with various RS232 designs from the
Trust-Hub (www.trust-hub.com) benchmark.
6 Conclusion
Acknowledgment. This work was in part supported by the Air Force Summer Fac-
ulty Fellowship Program (SFFP 2015) at the Air Force Research Lab, Cyber Assurance
Branch in Rome, NY. The authors would like to thank the Air Force and Information
Institute for all the support they provided during the summer 2015.
References
1. IEEE standard for property specification language (PSL). IEEE Std 1850–2010
(Revision of IEEE Std 1850–2005) pp. 1–182, April 2010
2. ARM: Trustzone. http://www.arm.com/products/processors/technologies/
trustzone/
3. Banga, M., Hsiao, M.: A region based approach for the identification of hardware
Trojans. In: IEEE International Workshop on Hardware-Oriented Security and
Trust, HOST 2008, pp. 40–47, June 2008
4. Bhunia, S., Abramovici, M., Agrawal, D., Bradley, P., Hsiao, M., Plusquellic,
J., Tehranipoor, M.: Protection against hardware trojan attacks: towards a com-
prehensive solution. IEEE Des. Test 30(3), 6–17 (2013)
5. Bhunia, S., Hsiao, M., Banga, M., Narasimhan, S.: Hardware trojan attacks: threat
analysis and countermeasures. Proc. IEEE 102(8), 1229–1247 (2014)
6. Çakir, B., Malik, S.: Hardware Trojan detection for gate-level ICS using signal cor-
relation based clustering. In: Proceedings of the 2015 Design, Automation & Test
in Europe Conference & Exhibition, DATE 2015, EDA Consortium, San Jose, CA,
USA, pp. 471–476 (2015). http://dl.acm.org/citation.cfm?id=2755753.2755860
7. Chakraborty, R.S., Bhunia, S.: Security against hardware Trojan attacks
using key-based design obfuscation. J. Electron. Test. 27(6), 767–785 (2011).
http://dx.doi.org/10.1007/s10836-011-5255-2
8. Forte, D., Bao, C., Srivastava, A.: Temperature tracking: an innovative run-time
approach for hardware Trojan detection. In: 2013 IEEE/ACM International Con-
ference on Computer-Aided Design (ICCAD), pp. 532–539, November 2013
9. Glazberg, Z., Moulin, M., Orni, A., Ruah, S., Zarpas, E.: PSL: beyond hardware
verification. In: Ramesh, S., Sampath, P. (eds.) Next Generation Design and Ver-
ification Methodologies for Distributed Embedded Control Systems, pp. 245–260.
Springer, Netherlands (2007). doi:10.1007/978-1-4020-6254-4 19
10. Group, O.W: Open verification library (OVL) working group.http://accellera.org/
activities/working-groups/ovl
11. Hategekimana, F., Tbatou, A., Bobda, C., Kamhoua, C.A., Kwiat, K.A.: Hard-
ware isolation technique for IRC-based botnets detection. In: International Confer-
ence on ReConFigurable Computing and FPGAs, ReConFig 2015, Riviera Maya,
Mexico, 7–9 December 2015, pp. 1–6 (2015). http://dx.doi.org/10.1109/ReConFig.
2015.7393319
12. Lamech, C., Rad, R., Tehranipoor, M., Plusquellic, J.: An experimental analysis of
power and delay signal-to-noise requirements for detecting Trojans and methods
for achieving the required detection sensitivities. IEEE Trans. Inf. Forensics Secur.
6(3), 1170–1179 (2011)
Hardware Sandboxing 59
13. Mitra, S., Wong, H.S.P., Wong, S.: Stopping hardware Trojans in their tracks. A
few adjustments could protect chips against malicious circuitry. http://spectrum.
ieee.org/semiconductors/design/stopping-hardware-trojans-in-their-tracks
14. Pnueli, A.: Special issue semantics of concurrent computation the temporal
semantics of concurrent programs. Theoret. Comput. Sci. 13(1), 45–60 (1981).
http://www.sciencedirect.com/science/article/pii/0304397581901109
15. Sengupta, A., Bhadauria, S.: Untrusted third party digital IP cores: power-delay
trade-off driven exploration of hardware Trojan secured datapath during high
level synthesis. In: Proceedings of the 25th Edition on Great Lakes Symposium
on VLSI, GLSVLSI 2015, NY, USA, pp. 167–172 (2015). http://doi.acm.org/10.
1145/2742060.2742061
16. Tehranipoor, M., Koushanfar, F.: A survey of hardware Trojan taxonomy and
detection. IEEE Des. Test Comput. 27(1), 10–25 (2010)
17. Venema, W.: Isolation mechanisms for commodity applications and platforms.
Technical report RC24725 (W0901–048), IBM, January 2009
18. Wei, S., Potkonjak, M.: Scalable hardware Trojan diagnosis. IEEE Trans. Very
Large Scale Integr. (VLSI) Syst. 20(6), 1049–1057 (2012)
19. Xiao, K., Tehranipoor, M.: BISA: built-in self-authentication for preventing hard-
ware trojan insertion. In: 2013 IEEE International Symposium on Hardware-
Oriented Security and Trust (HOST), pp. 45–50, June 2013
20. Zhang, X., Tehranipoor, M.: Case study: detecting hardware Trojans in third-party
digital IP cores. In: 2011 IEEE International Symposium on Hardware-Oriented
Security and Trust (HOST), pp. 67–70, June 2011
Rapid Development of Gzip with MaxJ
1 Introduction
Gzip is a popular utility and widely used file format for lossless data compression.
In this paper, we compare different implementations of the gzip compression on
FPGAs using various languages. All implementations use very similar system
architectures and are inspired by previous work by IBM [1].
This study provides an opportunity to show, how choices regarding the pro-
gramming language offer distinct trade offs in productivity, performance and area
utilization. This is of special interest, since FPGAs provide many possibilities to
accelerate tasks while reducing energy consumption at the same time.
Designer productivity, and thereby development time, is a major cost factor
in system design. While we acknowledge the challenges with accurately measur-
ing productivity, especially in a comparable and quantified way, we still draw
some claims on productivity advantages in the context of gzip development.
In recent years, different high-level synthesis tools emerged, in order to over-
come the high complexity of hardware description languages such as VHDL and
Verilog especially when targeting FPGAs. One of these tools provided by Altera
is based on the OpenCL standard [2]. The programmer writes C-like code with
additional OpenCL features to guide Altera’s SDK in creating FPGA bitstreams.
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 60–71, 2017.
DOI: 10.1007/978-3-319-56258-2 6
Rapid Development of Gzip with MaxJ 61
A different approach are new languages for hardware description, which main-
tain the concepts known from high-level programming languages and thereby
preserve their comfort while targeting hardware. One example is MaxJ by Max-
eler and OpenSPL [3]. MaxJ is a Java based language with additional features
and libraries to enable the rapid creation of FPGA designs.
To emphasize the OpenCL advantages Altera published the results of their
gzip implementation [4] and compared them to results published by IBM. In
this paper, an implementation of the same algorithm in MaxJ is presented and
compared to related work in Verilog (IBM) and OpenCL (Altera).
The main contributions of this paper are:
of Result compared to traditional RTL design [6]. Vivado HLS is not a push-
button C-to-FPGA synthesis tool and requires various manual transformations
to customise the hardware architecture and achieve well performing designs.
Additionally Xilinx provides SDAccel which is a programming environment
for OpenCL, C and C++. Additionally to the compiler, it also provides a sim-
ulator and profiling tools. Xilinx claims to achieve up to 20% better results
than with hand-coded RTL designs and 3× better performance and resource
efficiency compared to OpenCL solutions by competitors. SDAccel also supports
partial runtime reconfiguration of FPGA regions without halting the remaining
accelerators running on the chip [7].
IBM’s Liquid Metal supports data flow and map-reduce. The Lime language
is Java based and supports CPUs as well as FPGAs and GPUs. The hardware
type is chosen at runtime based on available capacities in the datacenter [8].
Catapult C creates FPGA and ASIC designs from ANSI C++ and System-
C descriptions [9]. Similar to other high-level synthesis tools, it requires the
designer to perform iterations on the original C-code and manually tweak the
hardware architecture in order to achieve a fast implementation.
Chisel is a Scala based hardware description language. Unlike other
approaches focusing on synthesis from a C-like language, the concept behind
Chisel is to add modern programming language features to a hardware descrip-
tion language. Design is still low level but the goal is to improve productivity by
supporting high-level abstractions in the language [10].
The next section will explain the main advantages and differences of MaxJ.
Maxeler’s MaxelerOS and the SLiC library the simulation models or hardware
configurations are tightly integrated into a CPU executable written in for exam-
ple C, Fortran, Matlab or Python to allow rapid development of FPGA acceler-
ated applications. The communication between FPGA and CPU is implemented
using very high-level streaming primitives and there is no need for the user to
worry about any of the low level details.
Maxeler’s data-flow systems are built using its proprietary PCIe data-flow
engines (DFEs). The MAX4 DFEs incorporate the largest Altera Stratix-V
FPGAs as a reconfigurable computing substrate. This device is connected to a
large capacity parallel DRAM (24-96 GB) to facilitate large in-memory datasets.
Additionally DFEs for networking are available which offer additional connec-
tivity via a maximum of three 40 GBits ports.
3 Gzip
Gzip is a utility [12] as well as a file format for lossless data compression [13]. For
data compression DEFLATE [14] is used, which is a combination of Lempel-Ziv
compression [15] and Huffman encoding [16].
The idea of the Lempel-Ziv compression algorithm is to replace multiple
occurrences of equivalent byte sequences with a reference to the first sequence.
This reference consists of a marker, showing that this data has to be interpreted
64 N. Voss et al.
as an index, a match length, indicating how many bytes are equal, and an offset,
defining the distance to the first occurrence of the byte sequence.
Huffman coding replaces all data in a symbol stream with code words. It is
an entropy encoder, which means that frequently used words will require less
bits. A Huffman code is a prefix code which guarantees that no code word is a
prefix of any other codeword and, as a result, unambiguous encoding.
The gzip standard knows two different forms of Huffman codes. The simpler
one is the static Huffman code which is defined in the standard itself [14]. A
different option is to create a customized Huffman code based on the actual input
data. The Huffman code itself then needs to be encoded as well to enable the
decompressor to correctly decode the data. Therefore the compressed Huffman
code description is placed before the actual compressed data in the data-stream.
While often providing better compression ratio this method is more complex to
implement and leads to extra calculations at runtime.
Since gzip is so widely used, there are many different implementations of it.
Intel published a high throughput CPU implementation achieving a throughput
of 0.34 GB/s [17]. There are also many high-throughput FPGA implementations
like the already mentioned implementations by Altera [4] and IBM [1] which
achieve throughputs between 2.8 and 3 GiB/s. A more recent FPGA based pub-
lication by Microsoft reports a throughput of 5.6 GB/s [18]. In addition, ASIC
implementations of gzip exist with throughputs of up to 10 GB/s [19].
ports they used a fixed number of hash tables with one read port each. The main
idea is, that the possible hash keys are equally distributed onto different hash
tables. Then if m hash tables are created, the least significant log2(m) bits are
used to determine which hash table is used for each hash value. In order to be
able to save different data items for the same hash value, each hash table can be
copied. So in order to avoid hash conflicts a different copy of the hash table can
be used. The hash tables run at double frequency compared to the remaining
design which effectively doubles the number of read and write ports.
The biggest problem with this implementation is that for a given set of
least significant bits only two writes can be accomplished in one cycle. All other
matches, which hash keys have the same least significant bits, have to be dropped
slightly reducing the compression ratio. With this optimizations and a few other
small changes Microsoft was able to increase the throughput significantly with
limited impact on the compression ratio.
Since Microsoft did not report design time we can not directly compare
against their design process and will focus on those used by Altera and IBM.
The hash table lookup provides n2 possible matches, since we perform n
lookups for each input byte. The first step is to perform the actual match search,
which requires a comparison of the input data with the already processed data
stored in the hash tables. The target is to find the longest match starting at each
position in the input window, to allow encoding with as few bits as possible. In
order to avoid complex inter-cycle dependencies the maximal match length is
limited to the number of bytes read per cycle.
Since one byte may be covered by multiple matches, only a selection of all
found matches has to be encoded. Decisions made here also impact the encoding
in the next cycle, since a match might also cover symbols of the next input
window. Since the design has to be fully pipelined, this inter cycle dependency
has to be resolved within one cycle to prevent pipeline stalls.
If a match only covers a few symbols it might be cheaper to encode this as
literals and not as a match. In this case the match will be ignored. A heuristic
is applied on the remaining matches to resolve the inter-cycle dependencies.
This heuristics takes the match for the last symbol in the input window as
the maximal match length into the input window of the next cycle. Since the
maximal match length is n the last symbol is never covered by a match in a
previous input window and thereby we do not have to consider any other inter
66 N. Voss et al.
cycle dependencies here. While this heuristic may decrease the compression ratio,
it enables a fully pipelined design while limiting the design complexity.
In order to finally select the matches first all matches for symbols that were
already covered by a match from the previous cycle are removed. Then the reach
of each match is calculated, which is defined as the sum of the position of the
current symbol and the match length. If two symbols have the same reach, they
encode all symbols up to the same position and the match which covers more
symbols in total is selected. In [4] a more detailed explanation is available.
At last, the data has to be encoded using Huffman coding. This can be done
symbol-wise after the match selection. These code words then get combined using
shifters and OR-gates to form the final output bitstream.
Altera uses bit vectors instead so that for every similar byte a bit in the vector
is set as shown in Listing 1.1 and Fig. 3. The advantage is that OR operations
and shifters cost less than ADDs and MUXs. It also enables the scheduler to
Rapid Development of Gzip with MaxJ 67
use less FIFOs to implement this part of the algorithm, since all OR operations
can be scheduled in the same clock cycle and there is no dependency between
the different iterations of the unrolled loop. As a result the OR operations can
be scheduled in a tree like fashion which reduces the number of required FIFOs.
By using this technique a 5% reduction of logic resources is claimed.
1 // compare current / c o m p a r i s o n w i n d o w s
2 # pragma unroll
3 for ( char k = 0; k < LEN ; k ++)
4 {
5 if ( curr_window [ j + k ] == comp_window [ k ][ i ][ j ])
6 length_bool [ i ][ j ] |= 1 << k ;
7 }
condi on concat
Writing the same code in MaxJ would already reduce resources, since the shifts
are omitted in hardware as the result of these operations would be computed at
build time instead. This, as stated in [4], is not done by the OpenCL SDK.
Listing 1.2 shows an equivalent MaxJ implementation with some additional
improvements. The # operator is used to concatenate bits. So in this case we
concatenate all results of the comparators bit-by-bit, which does not use any
additional resources. Also we do not need any registers or FIFOs because the
concatenation has no latency at all. The only costs come from the comparators.
The result of that is also shown in Fig. 4.
1 // compare current / c o m p a r i s o n w i n d o w s
2 lengthBool [ i ][ j ] = currWindow [ j ] === compWindow [0][ i ][ j ];
3 for ( int k = 1; k < LEN ; k ++) {
4 lengthBool [ i ][ j ] #= currWindow [ j + k ] === compWindow [ k ][ i ][ j ];
5 }
Other MaxJ language features make it easier to meet timing. For example, the
calculated hash keys are used at many different places and, as a result, have
quite a large fanout. Since a huge chunk of the available memory resources on
the FPGA are used for hash tables, the hash keys have to be routed to very
distant locations of the chip. In order to compensate this and help meeting
timing, an additional register was added after the hash key calculation as shown
68 N. Voss et al.
in Listing 1.3. The place and route tools can now duplicate this register, if needed,
in order to distribute the signal to all hash tables, where it is used for addressing.
1 for ( int i = 0; i < bytesPerCycle ; i ++) {
2 hashKey [ i ] = optimization . pipeline ( calculateHashKey ( currWindow , i ) ) ;
3 }
Listing 1.3. Adding a Register to the hashKey signal, which is returned by the
calculateHashKey() function. It is then passed into the optimization.pipeline() function
to add the register.
On the FPGA platform used by Altera the input data gets transmitted over
PCIe to DDR3 memory. The same principle applies to the encoded data which
first is written into DDR3 memory before it is send back to the host via PCIe.
In the MaxJ design the data does not need to be buffered in external memory
but can be send directly via PCIe to the FPGA where it is processed.
Since on-chip memory capacity is the limiting factor of the gzip design a
different implementation of the Huffman encoding was used. Altera used a lookup
table which can be changed by the CPU. In our design we calculate the Huffman
code words on the fly and do not waste any on-chip memory.
This slightly limits the adaptability since only one fixed Huffman tree is
available. This tree is optimized to all possible match lengths but could also be
optimized for known payloads. While no big impact on compression ratio could
be observed, this change is key in enabling our design to process 20 bytes per
cycle. Both, IBM and Altera designs process only 16 Bytes per cycle.
5 Performance Evaluation
We now compare the performance and area utilization of the different designs.
The area utilization is compared in Table 1. First, we are going to only compare
the 16 byte per cycle MaxJ design with the designs implemented by IBM and
Altera, since all these designs process the same number of bytes per cycle. The
MaxJ design uses significantly less resources as the OpenCL design. The area
utilization numbers for the IBM design shown here were estimated and reported
by Altera based on a chip image [4]. So while we can only work with estimations,
we can still assume that the logic utilization of the MaxJ design in comparison
to the Verilog design is at least on par. Only the RAM utilization is higher which
is probably caused by the scheduling overhead of 443 pipeline stages in contrast
to the 17 stages of the Verilog design. Despite the fact that the OpenCL design
uses only 87 pipeline stages the MaxJ design uses fewer memory resources.
Throughput and compression ratio differences are depicted in Table 2. The
compression ratio for all designs was evaluated using the calgary corpus [22] and
the geometric mean. While the compression ratio of the Intel, IBM and Altera
designs are almost identical, the MaxJ design shows a slight improvement. The
reason for this is probably a different hashing function (as described in [23])
which improves the compression ratio at the cost of additional logic resources.
Rapid Development of Gzip with MaxJ 69
While IBM reported a frequency of “just under 200 MHz” [1], Altera claims
a frequency of 193 MHz. Our MaxJ design for 16 Bytes successfully runs at
200 MHz without any optimizations aimed to help meeting timing.
When we use the available space to process 20 bytes per cycle instead of 16
and additionally perform timing optimizations, our design achieved a throughput
of 5 GB/s at 250 MHz. This makes our design nearly 15× faster than Intel’s
high throughput CPU implementation and nearly 1.8× faster than the OpenCL
implementation by Altera [1,4].
6 Productivity Discussion
In [4] Altera reported one month development time for their OpenCL gzip imple-
mentation. The MaxJ design presented here was performed by one intern student
within a single month. The intern was novice to MaxJ and had only one week
to work through the MaxJ tutorials. This clearly shows that learning MaxJ can
be quick with a software development background in high-level languages.
An advantage of HLS in contrast to classical hardware description languages
is, that the code is very readable and compact (the entire MaxJ gzip code is
only 959 lines). This makes it easier to focus on optimizations and to make big
changes in the architecture, since modern programming tools like unit tests can
be used in combination with the simulator to quickly validate functionality. For
example, the switch from the 16 byte per cycle design to 20 bytes was done by
only changing a single constant in the code.
Because the MaxJ tools create deeply pipelined structures meeting timing is
easier. While deep pipelining increases the overall memory usage it enables the
designer to use more space of the chip productively.
As previously mentioned, Microsoft also reported an FPGA based gzip design
using a slightly modified design architecture [18] achieving 5.6 GB/s on a Stratix
70 N. Voss et al.
V FPGA. We were able to also create a design using their architecture and again
achieve a higher throughput of 9.6 GB/s. Since we could reuse most of the already
written MaxJ code, the actual implementation time went down to roughly one
week. A few more weeks of not full-time effort were needed in order to fine-tune
parameters like the used hash function and hash tables configuration as well as
improve timing. It has to be noted that while meeting timing is time consuming
it is not as costly as development time, since it mainly requires CPU time and
not engineering effort.
When comparing to OpenCL, we can see that in a similar time far better
results could be achieved with MaxJ. A reason for this is the more direct control
over the hardware provided by MaxJ. This allows designers with good under-
standing of the underlying hardware to benefit from those additional improve-
ments. For example, the option to directly insert registers in the design (as
shown in Sect. 4) allows easier timing closure. Another good example is the
direct impact that widths of the variables have on the hardware area utilization.
While it is possible to reuse existing OpenCL designs for CPUs and GPUs to
target FPGAs it has to be noted, that the performance of the ported designs will
be suboptimal in most cases. For example in [24] the same OpenCL code was
executed on CPUs and FPGAs. The CPU versions all outperform the FPGA ver-
sions even though efficient hardware implementations for the tested algorithms
exist. This shows that, similar to most other high-level synthesis frameworks
(see Sect. 2), it is necessary to employ a series of code transformations in order
to create efficient hardware designs. As a result a change of the programming
language as well as the associated toolchain introduces only a limited overhead.
The above suggests that developing in MaxJ is significantly faster than in
OpenCL since we had enough time to perform careful timing optimizations and
compression ratio improvements. As a result this enabled us to deliver a signifi-
cantly better bitstream in terms of throughput and compression ratio.
7 Conclusion
In this paper we presented a rapid FPGA implementation of gzip compression.
We demonstrated that using MaxJ for high-level synthesis enabled us to achieve
better results within the same amount of development time as compared to
OpenCL. Furthermore, we showed that MaxJ and its development tools enable
very competitive development times in comparison to classical hardware descrip-
tion approaches. Our design outperforms the OpenCL implementation by 1.8×
in terms of throughput and delivers 5% better compression ratio by using only
∼10% more resources. In addition, the presented design achieves a 1.7× higher
throughput as compared to the Verilog implementation by IBM.
References
1. Martin, A., Jamsek, D., Agarwal, K.: FPGA-based application acceleration: case
study with GZIP compression/decompression stream engine. In: International Con-
ference on Computer-Aided Design (ICCAD), November 2013
Rapid Development of Gzip with MaxJ 71
2. Altera: OpenCL for Altera FPGAs: Accelerating Performance and Design Pro-
ductivity (2012). http://www.altera.com/products/software/opencl/opencl-index.
html
3. OpenSPL (2015). http://www.openspl.org
4. Abdelfattah, M.S., Hagiescu, A., Singh, D.: Gzip on a chip: high performance
lossless data compression on FPGAs using OpenCL. In: International Workshop
on OpenCL ACM, pp. 1–9 (2014)
5. Rashid, R., Steffan, J.G., Betz, V.: Comparing performance, productivity and scal-
ability of the TILT overlay processor to OpenCL HLS. In: Field-Programmable
Technology (FPT). IEEE, pp. 20–27 (2014)
6. Vivado HLS. http://www.xilinx.com/support/documentation/sw manuals/ug1197-
vivado-high-level-productivity.pdf. Accessed 18 Nov 2015
7. Xilinx: The Xilinx SDAccel Development Environment (2014). http://www.xilinx.
com/publications/prod mktg/sdx/sdaccel-backgrounder.pdf
8. Liquid Metal (2015). www.research.ibm.com/liquidmetal/
9. Catapult C (2015). http://calypto.com/en/products/catapult/overview/
10. Bachrach, J., et al.: Chisel: constructing hardware in a Scala embedded language.
In: Design Automation Conference (DAC). ACM, pp. 1216–1225 (2012)
11. Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for
heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66–73 (2010)
12. Gzip (2015). http://www.gzip.org
13. Deutsch, P.: Gzip file format specification version 4.3 (1996). http://tools.ietf.org/
html/rfc1952
14. Deutsch, P.: RFC 1951 deflate compressed data format specification version 1.3
(1996). http://tools.ietf.org/html/rfc1951
15. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE
Trans. Inf. Theory 23(3), 337–343 (1977)
16. Huffman, D.A.: A method for the construction of minimum-redundancy codes. In:
Proceedings of IRE, vol. 40, no. 9, pp. 1098–1101 (1952)
17. Gopal, V., Guilford, J., Feghali, W., Ozturk, E., Wolrich, G.: High Perfor-
mance DEFLATE Compression on Intel Architecture Processors (2011). http://
www.intel.com/content/dam/www/public/us/en/documents/white-papers/
ia-deflate-compression-paper.pdf
18. Fowers, J., Kim, J.-Y., Burger, D., Hauck, S.: A scalable high-bandwidth architec-
ture for lossless compression on FPGAs. In: 23rd IEEE International Symposium
on Field-Programmable Custom Computing Machines, pp. 52–59 (2015)
19. AHA 378 (2015). http://www.aha.com/data-compression/
20. Huang, W.-J., Saxena, N., McCluskey, E.J.: A reliable LZ data compressor on
reconfigurable coprocessors. In: Symposium on Field-Programmable Custom Com-
puting Machines. IEEE, pp. 249–258 (2000)
21. Hwang, S.A., Wu, C.-W.: Unified VLSI systolic array design for LZ data compres-
sion. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 9(4), 489–499 (2001)
22. Calgary Corpus (2015). http://corpus.canterbury.ac.nz/descriptions/#calgary
23. Sadakane, K., Imai, H.: Improving the speed of LZ77 compression by hashing and
suffix sorting. IEICE Trans. Fundam. Electr. Commun. Comput. Sci. E83–A(12),
2689–2698 (2000)
24. Ndu, G., Navaridas, J., Lujan, M.: Towards a benchmark suite for OpenCL FPGA
accelerators. In: Proceedings of 3rd International Workshop on OpenCL (IWOCL
2015), NY, USA, Article 10
On the Use of (Non-)Cryptographic Hashes
on FPGAs
1 Introduction
Hash functions are used to calculate a fixed-size hash value from a given input
of arbitrary length. They have numerous applications, e.g., hash tables, integrity
protection, Bloom filters, or authentication, making them a vital component in
almost any computer system. These applications are built on top of standard
CPU-based systems as well as dedicated hardware like FPGAs. Nevertheless,
the requirements for a fast and efficient algorithm differ substantially between
software for CPUs and hardware description for FPGAs. The advantage of a
hardware implementation lies in the potential for massive parallelization at a
comparatively low clock rate. In practice, many fast hash functions used for
hash tables were designed originally for software and do not perform well when
implemented in hardware [1].
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 72–80, 2017.
DOI: 10.1007/978-3-319-56258-2 7
On the Use of (Non-)Cryptographic Hashes on FPGAs 73
This problem becomes even more relevant if the hash application requires
multiple, independent hash values of the same key. Examples for these appli-
cations are hash tables with double hashing [2], cuckoo hashing [3], or Bloom
filters [4]. The developer has to choose between re-using one hash module, which
increases the latency for the calculation, or implementing multiple hash modules
at the cost of higher resource usage. As both methods bear significant drawbacks,
the question arises whether it is possible to exploit the fact that the required
hash size is often significantly smaller than the actual size of the hash function’s
output. If there are no weaknesses in the output of a given hash function, the
hash could be split into multiple sub-hashes. Although some authors argue that
small flaws in the hash calculation are acceptable for hash tables when the full
hash value is used [5], it is not clear if this is still the case when only parts of
the hash are used.
Many non-cryptographic hash functions reveal issues when their avalanche
effect is analyzed [6,7], in the sense that some input bits do not optimally prop-
agate through the function. One of our goals is to determine the implications of
those weaknesses with regard to our desired sub-hashes. It should be noted that
a good avalanche effect of the function still does not necessarily imply there are
no weaknesses in the hash, as can be seen for, e.g., the MurmurHash [8].
As previously mentioned, one must be aware that fast and efficient hash func-
tion designs for CPUs and hardware differ. Regarding the hardware implementa-
tion, the most important metrics of a hashing algorithm are resource utilization,
latency l in clock cycles, and execution time as a result of the maximum possi-
ble clock rate. Typical hardware hash implementations are not fully pipelined,
meaning they are blocked until one calculation finishes. A significant fraction of
common hash functions suffer from large latencies when implemented in hard-
ware [1], making them less suitable for, e.g., high-speed network applications.
Furthermore, to gain full advantage of a highly parallelized processing pipeline,
it is often necessary to process one key per clock cycle. In this case, the hash
functions must be implemented l times in hardware in order to be used in a
round-robin manner.
The growing importance of dedicated, feature-rich hardware components led
to a shift in requirements when new standard algorithms are defined. For exam-
ple, the winning candidate for the Secure Hash Algorithm 3 (SHA3) [9] was
required to perform well in hardware. This opens the question whether such a
hardware-optimized cryptographic hash function is more suitable than the non-
cryptographic alternatives mentioned above.
The main contributions of this work are threefold: first, we show how sta-
tistical relevant weaknesses in the avalanche effect of a hash function can affect
the uniformity of sub-hashes. Second, we examine the characteristics of sev-
eral hash functions when implemented for a multi-hash FPGA use case. Third,
based on these results, we demonstrate that SHA3 is currently a better choice
for many FPGA use cases regarding these characteristics in comparison to non-
cryptographic hashes.
74 A. Fiessler et al.
2 Related Work
The key-value lookup accomplished by hash tables is important for a variety of
networking tasks like stateful packet filtering, route lookup, or intrusion detec-
tion. Since dedicated hardware is increasingly used for these types of applica-
tions, hash table implementations for FPGAs have been widely discussed [5,10].
Bloom filters [4] can be a fast and efficient alternative when the only task is to
query if a key is present in the filter. Since this is the case for many classification
tasks in networking systems, Bloom filters experience wide application in this
field [11]. Their feasibility on FPGAs has been shown in [10,12]. With memory
lookups being the critical factor, Song et al. suggested using Bloom filters to
reduce the amount of hash table operations by first probing a Bloom filter if the
lookup is required in the first place [10]. If the query is negative, no expensive
hash table lookup is necessary.
Good hash functions are also of major and ongoing interest [9]. Countless
hash functions—cryptographically secure or not—have been introduced, quite
a few of which have been shown to have significant flaws with regard to the
expected qualities of a good hash function [8,13]. Hardware implementations of
several hash functions were analysed in [1], with the result that most of them per-
form badly, causing a latency too high for network processing applications [10].
When hash functions are used for hash tables, the main issue derives from
attackers being able to generate hash collisions with different keys. This can
degrade the performance of hash tables and allow for denial-of-service (DoS)
attacks [14]. Such flaws also led to security advisories, e.g., [8,15]. Bar-Yosef
et al. were able to successfully attack the hash table in Linux’ netfilter fire-
wall [16], even though a randomization technique was implemented in place to
protect against such attacks.
From the variety of applications for hash functions, we focus on FPGA use cases
requiring multiple, independent hash values for, e.g., hash tables using open
addressing by double hashing [2], cuckoo hashing [3], or Bloom filters [4].
There are different ways of generating such independent hash values out of
the same key: (1) using different hashing algorithms, (2) using the method of
double hashing, where two different hash functions are employed to compute the
hashes, (3) mixing distinct seeds to the key before feeding it to the same hash
function, and (4) splitting one hash value into non-overlapping sub-hashes.
The drawback of the first two options is that they either lead to higher
resource usage or increased latency, due to the fact that multiple hash functions
need to be computed. For the third option, it can be chosen whether to implement
multiple hash modules at the cost of logic resources or re-use one implementation
and thereby increasing the latency until all results are calculated. The last option
is the only option that saves both space and latency, but requires a hash function
of sufficient quality.
On the Use of (Non-)Cryptographic Hashes on FPGAs 75
4 Hash Algorithms
5 Evaluation
We begin our evaluation with an analysis of an avalanche-weak hash function
and show how the described method of splitting a hash value into sub-hashes is
affected. Afterwards, different examples of hashing algorithms are implemented
for an FPGA and the results are evaluated.
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
32 96 160 224 32 96 160 224
4 3
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
32 96 160 224 32 96 160 224
4 3
function of Bob Jenkins, (3) SipHash [20], the proposed alternative for hashes
like MurmurHash and CityHash, and (4) SHA3 [21], as a current state-of-the-art
cryptographic hash function. While the avalanche-weak Jenkins was included
in our evaluation for reference, CityHash and MurmurHash were not further
considered due to reported weaknesses [8,19].
All implementation results were determined for a fixed-size input key of
288 bit, correspondent to the quadruple of two IPv6-addresses and two port num-
bers. The targeted frequency for all implementations was 200 MHz. The results
were determined using Xilinx Vivado 2014.4 with a Virtex 7 690t, speed grade
−2 as the targeted FPGA. Both Jenkins and SpookyHash were implemented by
ourselves natively based on the available source code [17]. For SipHash, we used
the referenced Verilog implementation [22], for SHA3 a SHA3-512 core from [23].
The latency was identified by simulating the HDL implementation of each core.
Hash Size [bit] LUTs FFs Lat. [CC/ns] Use case LUTs Use case FFs
Jenkins 64 2, 874 3, 419 76/380 436, 848 (101.0%) 519, 688 (56.0%)
SpookyHash 128 3, 220 4, 161 27/135 86, 940 (20.1%) 112, 347 (13.0%)
SipHash 32 944 789 52/260 196, 352 (45.3%) 164, 112 (19.0%)
SHA3-512 512 6, 005 2, 212 20/100 120, 100 (27.7%) 44, 240 (5.1%)
As can be seen in Table 1 for the evaluated hash functions and their hash
value size, there are significant differences in terms of the usage of lookup tables
(LUTs), flip-flops (FFs), and the inherent latency (CC, in clock cycles and ns at
200 MHz) for the calculation. To improve comparability, we included an example
78 A. Fiessler et al.
calculation for a use case requiring eight independent 16-bit hash values of the
same key, with the capability of processing one key per clock cycle. This means
the hash core has to be replicated n = desired hash size
hash size × latency times. Note
that for SHA3, a smaller variant (e.g., SHA3-224) could be used since the hash
size is larger that the required use case output size. Since this use case assumes
that splitting the hash value bears no implications, the result for Jenkins is only
given as a reference. The percentages illustrate clearly that a significant amount
of the Virtex 7 FPGA resources are occupied for this use case. Moreover, a high
latency alone can be a criterion for exclusion depending on the application. For
comparison: in [10], the MD5 hash was deemed unsuitable for packet processing
applications due to the latency of 64 clock cycles based on speed requirements
present in the year 2005. Hence, from our evaluated candidates only SpookyHash
and SHA3-512 can be considered suitable for high-speed FPGA applications.
25,000 900
800
20,000 700
600
15,000 500
10,000 400
300
5,000 200
100
0 0
32 96 160 224 288 352 32 96 160 224 288 352
Fig. 2. FPGA utilization and latency for different hash value sizes.
As can be seen for the use case, the implementation results are dependent on
the desired size and amount of independent hash values, as well as resource usage
and latency of the hash functions. Two additional plots in Fig. 2 visualize this
for different hash value sizes, provided that all can be split into independent sub-
hashes of arbitrary size. Figure 2a assumes the total size is achieved by multiple,
parallel hash modules. In contrast, Fig. 2b shows how the latency is affected if
the necessary calculations are executed in series on one single hash module, thus
maintaining almost constant resource usage. Combined with the demonstrated,
possible implications of non-cryptographic hash functions, we argue that nowa-
days the cryptographic hash SHA3 should be the default choice for hardware
implementations, also for non-cryptographic applications.
6 Conclusion
Good hash functions are essential for a variety of applications. Since the usage
of FPGAs and dedicated hardware is increasing, the interest in fast hash-based
data structures like hash tables and Bloom filters will likely continue to rise.
On the Use of (Non-)Cryptographic Hashes on FPGAs 79
References
1. Shi, Z., Ma, C., Cote, J., Wang, B.: Hardware implementation of hash functions.
In: Tehranipoor, M., Wang, C. (eds.) Introduction to Hardware Security and Trust,
pp. 27–50. Springer, Heidelberg (2012)
2. Bookstein, A.: Double hashing. J. Am. Soc. Inf. Sci. 23(6), 402 (1972)
3. Pagh, R., Rodler, F.F.: Cuckoo hashing. In: Heide, F.M. (ed.) ESA 2001. LNCS,
vol. 2161, pp. 121–133. Springer, Heidelberg (2001). doi:10.1007/3-540-44676-1 10
4. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun.
ACM 13(7), 422–426 (1970)
5. Broder, A., Mitzenmacher, M.: Using multiple hash functions to improve IP
lookups. In: Proceedings of INFOCOM 2001, Twentieth Annual Joint Conference
of the IEEE Computer and Communications Societies, vol. 3. IEEE (2001)
6. Feistel, H.: Cryptography and computer privacy. Sci. Am. 228(5), 15–23 (1973)
7. Neustar Inc, “Choosing a Good Hash Function, Part 3,” February 2012.
https://research.neustar.biz/2012/02/02/choosing-a-good-hash-function-part-3/.
Accessed 15 November 2016
8. oCERT.org, “#2012-001 multiple implementations denial-of-service via
MurmurHash algorithm collision” (2012). http://www.ocert.org/advisories/
ocert-2012-001.html. Accessed 14 November 2016
9. “Federal Register, vol. 72, no. 212”. http://csrc.nist.gov/groups/ST/hash/
documents/FR Notice Nov07.pdf. Accessed 14 November 2016
10. Song, H., Dharmapurikar, S., Turner, J., Lockwood, J.: Fast hash table lookup
using extended bloom filter: an aid to network processing. ACM SIGCOMM Com-
put. Commun. Rev. 35(4), 181–192 (2005)
11. Broder, A., Mitzenmacher, M.: Network applications of bloom filters: a survey.
Internet Math. 1(4), 485–509 (2004)
12. Attig, M., Dharmapurikar, S., Lockwood, J.: Implementation results of bloom fil-
ters for string matching. In: 12th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines, FCCM 2004, pp. 322–323. IEEE (2004)
13. Klima, V.: Tunnels in hash functions: MD5 collisions within a minute. IACR Cryp-
tol. ePrint Arch. 2006, 105 (2006)
14. Crosby, S., Wallach, D.: Denial of service via algorithmic complexity attacks. In:
Usenix Security, vol. 2 (2003)
80 A. Fiessler et al.
1 Introduction
The Fast Fourier Transform (FFT) is a widely used transform algorithm in signal
processing applications, which is primarily a computational tool, used to efficiently
calculate the Discrete Fourier transform (DFT) and its inverse using digital computers.
Since its introduction by Cooley and Tukey [1], FFT has been the mainstay for spectral
analysis of digital signals. Spectral analysis is extensively used in communication
systems, signal processing, image processing, bio-robotics, intelligent maintenance and
almost every branch of science and engineering [2–4], making FFT one of the most
widely used algorithms on digital devices. With the advent of smart phones and hand
held media and entertainment devices, the performance and cost of FFT processors has
an ever greater significance. The computational speed, accuracy and chip area uti-
lization of FFT has a direct bearing on the cost and performance of modern digital
devices. Moreover, very high data rate applications such as real-time intelligent
The FFT is an efficient algorithm for computing the DFT, which maps an input
sequence xðnÞ into its equivalent frequency domain representation XðkÞ. The N-point
DFT of xðnÞ is defined as follows:
XN1
XðkÞ ¼ n¼0
xðnÞWNkn ; 0kN 1 ð1Þ
where WNkn is often referred to as the twiddle factor, given by the relation in Eq. (2).
2pkn 2pkn
WNkn ¼ eðj2pkn=NÞ ¼ cosð Þ j sinð Þ ð2Þ
N N
The FFT effectively uses the symmetry and periodicity of the complex twiddle
factors to compute the DFT. The Radix-2 DIF algorithm decomposes the N-point
output XðkÞ as given in Eq. (1) into even-numbered samples Xð2kÞ, and odd-numbered
samples Xð2k þ 1Þ, as given in Eqs. (3) and (4) respectively.
XN 1
N
Xð2kÞ ¼ 2
n¼0
xðnÞ þ x n þ WN2kn ð3Þ
2
XN 1
N
Xð2k þ 1Þ ¼ 2
n¼0
xðnÞ x n þ WN2kn WNn ð4Þ
2
The FFT algorithms are often implemented using either memory based or pipelined
architectures [14]. Memory-based FFT architectures consume fewer resources at the
cost of lower speed, whereas, pipelined architectures achieve higher speed at the cost of
more resources. In this study, the SDF-based pipelined architecture is chosen for
hardware implementation on FPGA as it requires fewer hardwared resources, has a
rather simple control logic and is adaptable to various FFT algorithms.
Fig. 2. Single-path delay feedback (SDF) architecture for Radix-2 DIF FFT
Fig. 3. The specified R2SDF pipelined architecture for 16-point FFT on harware
The pipelined design for 16-point FFT is shown in Fig. 3, whereas a R2DIF
butterfly operation based on the SDF architecture (R2DIFSDF), is shown in Fig. 4,
with a controller used to create the appropriate control signals for FFT computation. As
shown in Fig. 3, the R2DIFSDF pipelined architecture has four stages and each stage
includes two adders, a multiplier, shift registers for holding intermediate data and
multiplexers to select data for the butterfly operation. The size of shift registers equals
N=2 in the first stage and halves in the subsequent stages.
An FPGA-Based Implementation of a Pipelined FFT Processor 85
The operation of the R2DIFSDF takes place in three phases. In the 1st phase, the
multiplexer allows N=2 data points from the input to fill up the shift registers in N=2
clock cycles. During the 2nd phase, the butterfly computes FFT between the incoming
N=2 data points and those already stored in the shift registers. The adder’s output i.e.
xðnÞ þ xðn þ N=2Þ is directly forwarded to the next stage without any multiplication,
whereas, the subtractor’s output i.e. xðnÞ xðn þ N=2Þ, is fed back into the shift
registers for temporary storage. This is done for all the data points, i.e. 0 n N=2. In
the 3rd phase, the buffered data is moved from the shift registers to the multiplier for
complex twiddle factor multiplication and the product is directly forwarded to the next
stage to complete the butterfly operation. The complex twiddle factors for FFT rotation
are stored in a ROM. The N-point pipelined FFT architecture has the same SDF module
repeated in its log2 N stages. In general, a R2DIFSDF pipelined architecture for N-point
FFT contains about log2 N 1 multipliers, 2 log2 N adders and N 1 shift registers.
Fig. 5. Example of deployment for the constant multiplication based on the shift-add method
86 N.-H. Nguyen et al.
speed. If the size and radix are fixed for FFT, then the values of the twiddle factors are
constants and determined using Eq. (2). Hence, the operation of FFT rotation is turned
into constant multiplication. To optimize resources and reduce hardware complexity,
the twiddle factor multipliers are implemented by the shift-add method. The number of
shift and add operations depends on the number of non-zero bits in the constant;
example of X 55, as shown in Fig. 5. Using this method, only configurable logic
blocks are required, as opposed to dedicated multipliers. This way, the proposed design
saves 100% dedicated functional blocks in FPGAs.
4 Experimental Results
The proposed R2DIFSDF pipelined 1024-point FFT architecture, and the aforemen-
tioned architectures for comparison are implemented on a Xilinx Virtex 7 XC7VX485T
FPGA using the Vivado Design Suite tool for functional and timing simulation and
synthesis. A comparison of the two implementations for 1024-point FFT in terms of
hardware complexity and performance is provided in Table 1.
Table 1. A comparison the achieved results on hardware based on two different designs
(A) (B)
# of Slices registers (CLB flip-flops) 2,046 1,591
# of Slices LUTs 3,159 16,275
# of IOBs 92 92
# of Block RAM/FIFO 4 0
# of Block DSPs 30 0
# of Clocking BUFGCTRLs 1 1
Total # of clock cycles for execution 2,066 2,058
Execution time (lS) 10.332 10.290
The best way to qualify a proposed design is to consider not only performance, but
also area. It is necessary to have a clear measure for comparing area and performance of
all designs. Hence, the area is measured in slices only, which is the main component in
all FPGAs. In the Xilinx Virtex-7 FPGA family, each DSP block has a 25 18
multiplier and an accumulator, and it occupies around 500 slices. Whereas, the BRAM
block that is used for storing data, each has a capacity value that is equivalent to the
area used by 1700 slices. Table 2 shows the experimental results for the two imple-
mentations of 1024-point FFT, and compares their efficiency in terms of area and
execution time. The proposed design uses about 20% less slices and no dedicated
functional blocks at all.
(A): The traditional R2DIFSDF design based on memory using dedicated Xilinx
logic core blocks (BRAMs, DSPs)
(B): The proposed R2DIFSDF design with no dedicated functional blocks in the
architecture
The precision of the proposed design is measured by calculating the average rel-
ative percentage error and comparing with baseline results obtained using Matlab. The
results are presented in 16-bit fixed-point format with 10-bit precision, respectively.
The average relative percentage error of the proposed architecture is very low, about
0.52%. The variation between results obtained using the proposed design and a 64-bit
PC are not significant, especially in the face of gains made in hardware. The experi-
mental results show the high precision of the proposed FFT implementation on FPGA
hardware, which is better than Derafshi et al. [7] (1%) and Kumar et al. [15] (3.22%).
Table 3. Comparison the achieved results between the proposed design and the previous
designs.
Derafshi Harikrishna Xilinx Kumar The
[7] [3] [16] [15] proposed
design
# of points FFT 1,024 1,024 1,024 1,024 1,024
Operation clock 100 92.36 395 385 200
frequency (MHz)
# of slices registers 2,472 3,155 2,264 2,633 1,591
# of slices LUTs 10,353 5,916 1,987 1,883 16,275
# of block RAM/FIFO 32 – 10 8 0
# of block DSPs 10 16 12 17 0
Total # of slices 43,225 22,617 27,251 26,616 17,866
Total # of clock cycles 2,600 6,085 9,430 6,320 2,058
for execution
Execution time (lS) 26 65.89 23.87 16.376 10.290
5 Conclusion
Acknowledgments. This work was supported by the Korea Institute of Energy Technology
Evaluation and Planning(KETEP) and the Ministry of Trade, Industry & Energy (MOTIE) of the
Republic of Korea (No. 20161120100350, No. 20162220100050), in part by The Leading
Human Resource Training Program of Regional Neo industry through the National Research
Foundation of Korea (NRF) funded by the Ministry of Science, ICT and future Planning
(NRF-2016H1D5A1910564), in part by Business for Cooperative R&D between Industry,
Academy, and Research Institute funded Korea Small and Medium Business Administration in
2016 (Grants No. C0395147, Grants S2381631), and in part by the development of a basic fusion
technology in electric power industry (Ministry of Trade, Industry & Energy, 201301010170D).
References
1. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier
series. Math. Comput. 19, 297–301 (1965)
2. Sanchez, M.A., Garrido, M., Lopez-Vallejo, M., Grajal, J.: Implementing FFT-based digital
channelized receivers on FPGA platforms. IEEE Trans. Aerosp. Electron. Syst. 44, 1567–
1585 (2008)
3. Harikrishna, K., Rao, T.R., Labay, V.A.: FPGA implementation of FFT algorithm for IEEE
802.16 e (mobile WiMAX). Int. J. Comput. Theory Eng. 3, 197–203 (2011)
4. Chu, W., Champagne, B.: A noise-robust FFT-based auditory spectrum with application in
audio classification. IEEE Tran. Audio Speech Lang. Process. 16, 137–150 (2008)
5. Pitkänen, T.O., Takala, J.: Low-power application-specific processor for FFT computations.
J. Sig. Process. Syst. 63, 165–176 (2011)
6. Wang, Y., Tang, Y., Jiang, Y., Chung, J.-G., Song, S.-S., Lim, M.-S.: Novel memory
reference reduction methods for FFT implementations on DSP processors. IEEE Trans. Sig.
Process. 55, 2338–2349 (2007)
7. Derafshi, Z.H., Frounchi, J., Taghipour, H.: A high speed FPGA implementation of a
1024-point complex FFT processor. In: 2010 Second International Conference on Computer
and Network Technology (ICCNT), pp. 312–315. IEEE (2010)
8. Iglesias, V., Grajal, J., Sanchez, M.A., López-Vallejo, M.: Implementation of a real-time
spectrum analyzer on FPGA platforms. IEEE Trans. Instrum. Meas. 64, 338–355 (2015)
9. Zhou, B., Peng, Y., Hwang, D.: Pipeline FFT architectures optimized for FPGAs. Int.
J. Reconfigurable Comput. 2009, 1–9 (2009)
An FPGA-Based Implementation of a Pipelined FFT Processor 89
10. Garrido, M., Parhi, K.K., Grajal, J.: A pipelined FFT architecture for real-valued signals.
IEEE Trans. Circ. Syst. I: Regul. Pap. 56, 2634–2643 (2009)
11. Wang, Z., Liu, X., He, B., Yu, F.: A combined SDC-SDF architecture for normal I/O
pipelined Radix-2 FFT. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23, 973–977
(2015)
12. Ma, Z.-G., Yin, X.-B., Yu, F.: A novel memory-based FFT architecture for real-valued
signals based on a Radix-2 decimation-in-frequency algorithm. IEEE Trans. Circ. Syst. II:
Exp. Briefs 62, 876–880 (2015)
13. Luo, H.-F., Liu, Y.-J., Shieh, M.-D.: Efficient memory-addressing algorithms for FFT
processor design. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23, 2162–2172 (2015)
14. Joshi, S.M.: FFT architectures: a review. Int. J. Comput. Appl. 116, 1–5 (2015)
15. Kumar, M., Selvakumar, A., Sobha, P.: Area and frequency optimized 1024 point Radix-2
FFT processor on FPGA. In: 2015 International Conference on VLSI Systems, Architecture,
Technology and Applications (VLSI-SATA), pp. 1–6. IEEE (2015)
16. Xilinx, Inc.: Logic core IP Fast Fourier Transform v8.0, Product specifications DS808
(2012)
Simulation and Synthesis
Soft Timing Closure for Soft Programmable
Logic Cores: The ARGen Approach
1 Introduction
As integrated circuits become increasingly complex and expensive to develop,
the ability to apply post-fabrication changes appears all the more attractive.
A direct gain lies in eliminating the cost and time associated with re-spinning
silicon when fixing a bug or specializing the device to a specific application.
Embedding reconfigurable logic in designs offers a solution to the semiconductor
designers who need to update silicon post production.
In this context, several embedded FPGAs (eFPGA) have been developed as
reported in [1–3]. EFPGAs are flexible logic fabrics, that, once programmed,
implement digital circuits. But, unlike FPGAs, eFPGAs are intended to serve
as pieces of a whole system-on-chip design. This approach allows:
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 93–105, 2017.
DOI: 10.1007/978-3-319-56258-2 9
94 T. Bollengier et al.
This obviously comes at the cost of area and performance overheads, com-
pared to a straight silicon implementation. However there are even more serious
limitations [5].
First, every eFPGA embeds a fixed amount of reconfigurable resources. Any
mismatch between theses resources and the applications needs (in terms of
amount and nature of resources) is a serious issue. It may either prevent from
using this support (if the application requirements exceed the eFPGA resources)
or lead to a poor resources usage (internal fragmentation may nullify the advan-
tage of using an optimized hard eFPGA core). Then, tailoring eFPGAs in order
to set up a product line may seem attractive. Unfortunately, customizing eFPGA
size and resources towards an application domain is likely to cause lengthy devel-
opment cycles, as each new instance of hard eFPGA core must be silicon proven.
However, Kuon et al. [6] demonstrated automation of circuit design, layout and
verification, to cut off the required effort and time to design a new embedded
hard FPGA core.
Second, eFPGAs are hard IP cores, which integration is complex and time
consuming, and raises technology compliance issues, as all the cores must be
provided with the same technology. As an example, the System-on-Chip of [7]
was a scalable system infrastructure hosting heterogeneous reconfigurable accel-
erators, whose implementation required to migrate one of the accelerators to
90-nm, which resulted in a 6 months extra work.
This incites to move up a level of abstraction, based on soft macros that are
process-independent. Some works have been reported in designing Soft Program-
mable Logic Cores (SPLCs) as summarized in Sect. 2. This paper complements
these previous works by addressing some known issues in terms of scalability
and timing closure.
The main contribution is ARGen, a generator of soft reconfigurable cores.
ARGen supports core customization and trades a minor overhead against accu-
rate timing closure. Also, SPLCs come along with their programming environ-
ment. As a result, the SPLCs’ strengths (flexibility, just-fit dimensioning, per-
formances predictability) outweigh disadvantages in term of performances.
The remainder of this paper is organized as follows: Sect. 2 summarizes
related work on soft reconfigurable logic cores, Sect. 3 describes the structure
of the proposed SPLC, aiming to simplify both SPLC synthesis and system inte-
gration, Sect. 4 presents the exploitation tool flow and circuit timing analysis,
before Sect. 5.1 reports some results.
Soft Timing Closure for Soft Programmable Logic Cores 95
2 Background
Soft programmable logic cores (SPLC) have been introduced in [5,8] to empha-
size flexibility and shorten development time, hence promote agility. Unlike hard
core eFPGAs, synthesizable SPLCs are delivered as RTL descriptions, and syn-
thesizing such cores is done using usual tools (standard ASIC or FPGA flows).
Integrating SPLCs in a design is easy: a flat synthesis of designs with one
or many SPLCs requires no floorplanning.
Integrating SPLCs is safe: a whole design that contains SPLCs, can be ver-
ified, simulated and emulated without additional complexity.
Integrating SPLCs is a just-fit process: SPLCs can be easily customized
at the sole cost of updating the RTL description, with no need to silicon-
proof each modified instance again, so that domain space exploration may be
affordable.
Integrating SPLCs is reversible: the decision to use either a SPLC or fixed
logic to implement any subpart of a design remains reversible until just before
the chip goes to foundry. This decision stays on the designer who best knows
which subsystem may/will need later modifications, and how much flexibility
makes sense.
Integrating SPLCs supports optimization: authors in [9] demonstrated
that soft core area overhead can be reduced by 58% and the delay overhead
by 40% by creating custom standard cells (referred as tactical cells) that are
more suitable for reconfigurable architecture implementations, and by using
a tile-based approach to structure the layout of the hard macro.
As summarized above, SPLCs exhibit valuable features thanks to their RTL
nature, nevertheless two difficulties emerge, that prevent from a wide broad
adoption. First, the timing paths to explore are many. Second, the awareness of
physical timings is poor.
Unlike regular designs, SPLCs present unusually large number of potential
timing paths and combinatorial loops, due to their reconfigurable nature. This
stresses the synthesis tool and may limit the size and nature of SPLCs [9]. To
address this problem, authors in [5] propose to simplify the SPLC architecture
by removing programmable flip-flops and by allowing the signal flow to go only
in one direction, thus preventing combinatorial loops. As a consequence, the
SPLCs exclusively target combinatorial applications; the proposed architecture
is minimal which restricts the complexity and nature of applicative circuits to
be implemented.
Moreover, performing timing analysis of a circuit mapped on a SPLC may
be subject to a physical timing information miss. Exploiting the SPLCs goes
through synthesizing applications on the reconfigurable cores. This relies on a
synthesis tool -further referred as virtual synthesis tool- that is independent from
the physical synthesis tool (the standard ASIC tool flow) used to implement the
SPLC itself. As an example, in [5,8], the virtual synthesis tool is VPR [10].
The virtual synthesis tool executes timing-driven placement and routing, as well
as timing analysis. These steps require the tool to be aware of every physical
96 T. Bollengier et al.
delay of SPLC resources. In [8], these physical delays are approximated using
the conceptual representation of the SPLC. This results in an inaccurate circuit
timing analysis, as adjacent resources in the conceptual SPLC representation
may actually be positioned far apart in the silicon, thus tampering the delay
estimation. In [5,9], timing exceptions are set to ignore the unused SPLC paths
in the mapped circuit netlists when performing timing analysis according to the
physical ASIC tools. This ensures the delay measures of the mapped circuits’
critical paths are more reliable. However this comes at the cost of back and forth
navigation between virtual and physical synthesis tools. Another option would be
to extract an accurate information from the physical synthesis to feed the virtual
tools. Yet, extracting this information means collecting the elementary delays of
all arbitrary sub-segments of all combinatorial paths. This is of high complexity
and must be processed for each new SPLC physical synthesis. Besides, this can
only be considered a preliminary step, before the virtual synthesis tool actually
exploits this information. As a consequence, even if back annotating the SPLC
conceptual representation (used by the virtual tool flow) with actual physical
delays is considered in [5,11], it has never been implemented in practice.
Our contribution goes one step beyond, and lifts these limitations. In this
work, we propose a template for modifying SPLC architectures. This allows as
easy SPLC integration as reported in [5] - but with no restriction on the SPLC
architectures - while providing easy and accurate timing analysis of mapped
circuits, solely using the virtual synthesis tool.
3 SPLC Design
Using an SPLC assumes three pre-requisite steps: generating the SPLC architec-
ture, synthesizing this architecture to a physical target, and supporting system
integration. Once generated, the SPLC module becomes a library element that
can be instantiated within the application’s RTL description, then the whole
design is synthesized using an ASIC flow. The portable RTL description of the
SPLC supports flat synthesis of the whole design without the need for specific
steps such as floorplaning.
Then synthesizing and deploying applications onto the SPLC involve a ded-
icated software environment. This tool is independent from the physical tech-
nology, which in turn may require specific software development, as detailed in
Sect. 4.
Figure 1 shows how these two flows, which together contribute to making
SPLC a credible solution, relate and interact. The “ArGen” tool covers two
aspects as detailled in the next section: architecture generation and bitstream
production.
The SPLC has W idth × Height CLBs, each of which has I inputs and N
outputs. A CLB is composed of N BLEs (Basic Logic Element). A BLE has
98 T. Bollengier et al.
one LUT with K inputs and one register that can be bypassed (the application
register ). Inputs of BLEs are derived from a global crossbar with I + N inputs
(the I CLB inputs plus N feedback signals from the BLEs outputs). Each rout-
ing channel contains W unidirectional wires, in both directions, that can be
connected to other wires from adjacent routing channels, depending on how the
Switch Blocks (SB) are configured. Connections are implemented as multiplexers
that are controlled through their select signal(s) coming from the configuration
layer (as illustrated in Fig. 3).
The ARGen approach isolates the SPLC conceptual representation from its
physical implementation on silicon. The proposed solution is to inject extra regis-
ters within the SPLC to latch the output of every configurable multiplexer that
connects routing wire tracks. These registers are referred to as Virtual Time
Propagation Registers (VTPRs). VTPRs break down physical logic chains into
short segments, and prevent any combinatorial loop from appearing on the phys-
ical SPLC implementation, whichever its configuration. VTPRs are transparent
for circuits mapped on the SPLC, and do not appear in the SPLC conceptual
model.
VTPRs exhibit two decisive advantages. First, using VTPRs in a SPLC archi-
tecture alleviates the task of the physical synthesizer, as VTPRs reduce timing
paths in the SPLC architecture and prevent combinatorial loops. This promotes
architectures’ scalability. There is no more need to limit size and complexity of
synthesized architectures, nor to restrict the signal flow in one direction. This,
however, rises the need for an extra and faster clock (ClkV T P R ), to allow signal
propagation through VTPRs within one applicative clock cycle. Second, VTPRs
favors timing closure, as reported in Sect. 4.2. VTPRs brings no improvement
in term of performances of the synthesized SPLC. In that, VTPRs differ from
C-slowing [12] which can be combined with retiming for sake of throughput
increase.
Fig. 3. Implementation of a Switch Box (left) and a BLE (right), whith their associated
configuration registers from the configuration layer.
input config in vector and an input config valid bit. The number of config-
uration shift registers, forcing the size of the config in vector, is determined
to fit designer’s needs (the wider the interface, the faster the configuration, the
more area it consumes). The configuration controller can read the SPLC con-
figuration bitstream from an internal memory, be mapped on a bus in case of a
SoC, or even be accessible from outside the chip through the pinout.
4 SPLC Exploitation
When deploying applications onto SPLCs, no commercial tool fits the archi-
tecture, but some open-source academic works have been reported that offer a
customizable solution for application synthesis. The ARGen approach relies on
existing third parties tools, while offering a fast and accurate timing closure as
a strong contribution. To this end, in addition to RTL code, the ARGen tool
also generates VPR specific architecture description files. Additionally, ARGen
generates bitstream and executes timing analysis.
When synthesizing a SPLC, the timing reports indicate the Fmax frequency at
which the design may operate. Fmax depends on the worst case propagation
delay of SPLC atomic resources isolated between two VTPRs.
The virtual synthesis flow only relies on Fmax to perform timing analysis. At
the netlist level, assuming a net NC connects two logic nodes LA and LB , the
delay of the mapped net NC can be computed as the number of VTPRs along
the mapped path from LA to LB .
Adding VTPRs requires to operate two clocks: ClkV T P R , the VTPRs clock,
and Clkapp , clocking the application registers. To ensure that the mapped circuit
properly runs on the SPLC, ClkV T P R and Clkapp must abide by the relation:
where
NV T P R = max ( max (length(Nci ))) (2)
∀ hypernet Nc Nci ∈subnets(Nc )
In Eq. 2, the netlist is seen as a set of hypernets. These multi-terminal nets are
spread as a collection of monoterminal nets, each of which goes from and reaches
either an IO or a register.
This greatly simplifies and speeds up timing computation. Especially as the
Manhattan distance is a smart approximation of length. Then Eq. 2 profitably
replaces Elmore delay computation [16].
Soft Timing Closure for Soft Programmable Logic Cores 101
5 Experiments
The experiments rely on exploring the implementation cost of a parametric
SPLC. Then, the use of this SPLC is demonstrated on a regular expression
matching application.
Fig. 7. VTPRs make the synthesis time Fig. 8. VTPRs lead to a 3% average
affordable, hence promote scalability overhead in term of area
102 T. Bollengier et al.
Figure 7 shows that the synthesis time ranges from 2.08 to 4.85 s per BLE,
with 2.89 s/BLE in average when using VTPRs. Instead, this average rises up to
10.18 without VTPRs (ranging from 2.17 to 24.28). Besides, the standard devia-
tion is reduced from 8.22 to 0.97 when introducing VTPRs. The lessons learned
are two: first VTPRs save synthesis time, second VTPRs make synthesis time
predictable. Figure 8 shows that VTPRs come almost for free in term of area
(around 2% for bigger SPLCs). Also, as virtual prototyping usualy relies on
FPGAs as an experimental platform, Table 1 reports results when implementing
SPLCs -with and without VTPRs- on top of Xilinx FPGAs. Three stages are
reported: XST (RTL synthesizer), MAP and PAR (logic synthesizer, placer and
router), and TRCE (timing analyser).
VTPRs do not significantly impact synthesis time. On the opposite, MAP
and PAR show unpredictable execution time unless VTPRs are used. This comes
from the heuristics within these tools. In particular, the combinational loops
within the SPLCs are broken down into smaller netlists undeterministicaly.
TRCE seems to scale with regards to #BLEs. Again, the synthesis time is
shorter and more predictable when using VTPRs, which preserves the FPGAs
as a potential virtual prototyping platform when designing VTPR aware SPLCs.
Dimensions XST time MAP time PAR time TRCE time Total synthesis time
Size BLEs Raw VTPRs Ratio Raw VTPRs Ratio Raw VTPRs Ratio Raw VTPRs Ratio Raw VTPRs Ratio
2×2 16 32 32 1.00 133 120 1.11 103 66 1.56 35 33 1.06 303 251 1.21
4×4 64 68 61 1.11 228 238 0.96 344 138 2.49 71 39 1.82 711 476 1.49
6×6 144 159 146 1.09 530 298 1.78 60538 155 390.57 219 49 4.47 61446 648 94.82
8×8 256 328 354 0.93 909 524 1.73 3088 220 14.04 493 63 7.82 4818 1161 4.15
10 × 10 400 661 715 0.92 1842 763 2.41 4940 350 14.11 1208 84 14.38 8651 1912 4.52
12 × 12 576 1296 1371 0.94 4838 1179 4.10 39856 474 84.08 3289 105 31.32 49279 3129 15.75
14 × 14 784 2343 2487 0.94 4613 3904 1.18 18617 654 28.47 5007 154 32.51 30580 7199 4.24
5.2 Usage
template assumes an initial memory continuously streams data (one byte per
cycle) to the generated design whose role is to detect a match with a refer-
ence pattern. The detection scheme relies on a non-deterministic finite automata
(NFA) [17] to alleviate the need for backtracking (due to its multiple active
states). Table 2 illustrates the implementation cost of representative expressions
in terms of flip-flops and LUTs in the SPLC. The number of flip-flops only
depends on the pattern size, while the number of LUTS does on the pattern
complexity. The first five expressions score the cost of |, ?, + and ∗ constructs.
The last two illustrate real cases. The link expression looks for hyperlinks with a
known root. The full expression is: /<a\s+href="/courses/[^ "]*"[^ >]*>/.
ssh is of higher complexity and corresponds to searching ssh traces in a log
file. The full expression is: /[^ ]+ +\d+ \d+:\d+:\d+ [^ ]+ sshd\[\d+\]:
Accepted (password | publickey) for [^ ]+ from \d+\.\d+\.\d+\.\d+
port \d + ssh/.
Regex SPLC FF SPLC LUT SPLC BLE min NV T P R min size W min
/abcdefgh/ 8 12 12 10 2×2 4
/abcd|efgh/ 8 15 15 12 2×2 4
/a(bcdefg)?h/ 8 13 13 12 2×2 4
/a(bcdefg)+h/ 8 14 14 10 2×2 8
/a(bcdefg)*h/ 8 16 16 10 2×2 6
Link 23 44 44 14 4×4 12
ssh 76 99 100 18 6×6 12
6 Conclusion
without affecting the ASIC design flow. However, timing analysis of circuits
running on SPLCs usually comes to be inaccurate.
Our contribution tackles this issue by providing SPLCs decorated with
VTPRs. VTPRs are extra registers, which break down loops in the intercon-
nect in order to master the timings in the SPLC. This offers simplified timing
closure (predictable and accurate timings). Besides, VTPRs ensure scalability
when synthesizing the SPLC. Also, VTPRs make sense as an affordable feature,
and come at the sole cost of 3% area overhead in average.
Finally, this approach has been demonstrated through implementing regex
detection. This use case illustrates how SPLCs can support changing proto-
cols. This work also closely relates to overlays, which are usualy virtual coarse-
grain architectures, overlaying on top of fine-grained FPGA devices, for sake
of improved productivity, portability, debugging capabilities, etc. ARGEN has
demonstrated to suit designer’s needs when adressing overlays. Future work will
investigate how combining SPLC and overlays can drive new improvements.
Acknowledgement. This work has been supported by the French National Research
Agency under the contracts ANR-11-INSE-015 (ARDyT) and ANR-A0-AIRT-07
(B-Com).
References
1. Menta - embedded Programmable Logic. http://www.menta-efpga.com
2. Nanoxplore. http://www.nanoxplore.com
3. ADICSYS - eFPGA (embedded FPGA) IP. http://www.adicsys.com
4. Abramovici, M., Bradley, P., Dwarakanath, K.N., Levin, P., Memmi, G., Miller, D.:
In: Sentovich, E. (ed.) Proceedings of DAC 2006, pp. 7–12. ACM (2006)
5. Wilton, S.J., Kafafi, N., Wu, J.C., Bozman, K.A., Aken’Ova, V.O., Saleh, R.:
Design considerations for soft embedded programmable logic cores. IEEE J. Solid-
State Circ. 40(2), 485–497 (2005)
6. Kuon, I., Egier, A., Rose, J.: Design, layout and verification of an FPGA using
automated tools. In: Schmit, H., Wilton, S.J.E. (eds.) FPGA 2005, pp. 215–226.
ACM (2005). http://doi.acm.org/10.1145/1046192.1046220
7. Voros, N., Rosti, A., Hübner, M. (eds.): Dynamic System Reconfiguration in Het-
erogeneous Platforms. LNEE, vol. 40. Springer, Heidelberg (2009)
8. Kafafi, N., Bozman, K., Wilton, S.J.: Architectures and algorithms for syn-
thesizable embedded programmable logic cores. In: Proceedings of the 2003
ACM/SIGDA Eleventh International Symposium on Field Programmable Gate
Arrays, pp. 3–11. ACM (2003)
9. Ova, V.A., Lemieux, G., Saleh, R.: An improved “soft” eFPGA design and imple-
mentation strategy. In: Proceedings of the IEEE 2005 Custom Integrated Circuits
Conference, CICC 2005, pp. 179–182. IEEE (2005). http://dx.doi.org/10.1109/
CICC.2005.1568636
10. Betz, V., Rose, J.: VPR: a new packing, placement and routing tool for FPGA
research. In: Luk, W., Cheung, P.Y.K., Glesner, M. (eds.) FPL 1997. LNCS, vol.
1304, pp. 213–222. Springer, Heidelberg (1997). doi:10.1007/3-540-63465-7 226
11. Wiersema, T., Bockhorn, A., Platzner, M.: Embedding FPGA overlays into config-
urable systems-on-chip: ReconOS meets ZUMA. In: 2014 International Conference
on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–6, December 2014
Soft Timing Closure for Soft Programmable Logic Cores 105
12. Leiserson, C.E., Saxe, J.B.: Retiming synchronous circuitry. Algorithmica 6, 5–35
(1991)
13. University of California Berkeley. (1992) Berkeley logic interchange format(blif).
http://vlsi.colorado.edu/∼vis/blif.ps
14. Jamieson, P., Kent, K.B., Gharibian, F., Shannon, L.: Odin 2 - an open-source
verilog hdl synthesis tool for CAD research. In: FCCM 2010 (2010)
15. Brayton, R., Mishchenko, A.: ABC: an academic industrial-strength verification
tool. In: Touili, T., Cook, B., Jackson, P. (eds.) CAV 2010. LNCS, vol. 6174, pp.
24–40. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14295-6 5
16. Elmore, W.C.: The transient response of damped linear networks with particular
regard to wideband amplifiers. J. Appl. Phys. 19(1), 55–63 (1948)
17. Sidhu, R., Prasanna, V.K.: Fast regular expression matching using FPGAs. In:
FCCM, ser. FCCM 2001, pp. 227–238. IEEE Computer Society (2001)
FPGA Debugging with MATLAB Using
a Rule-Based Inference System
1 Introduction
However, testing and debugging have associated time and cost implications. Further-
more, it is difficult to gather debugging data for complex designs when the amount of
data is large and rapidly changing. However, if the simulation data can be linked to
verification through a software environment, the testing and debugging process
becomes much easier.
In this paper, a new methodology is introduced that addresses the visibility and
limited window size issues. The paper presents a methodology to ease the debugging
process with the help of a visual debugging tool implemented in MATLAB and hence
using the power of MATLAB to debug a system. We have developed a new verifi-
cation method based on hardware debugging using MATLAB as a tool and rule-based
inference system as a verification method of the hardware design. We will be using a
Gaussian filter based image processing system as a case study for illustrating the
proposed verification method. In our verification system, a golden reference (GR) is
utilized which can be defined using rule-based inference system or user defined. The
goal is to find bugs without the need to run the system intermittently and debugging the
complete window at one time utilizing the power of the MATLAB-based debugging
system which will in turn reduce debugging time and hence the overall design cycle.
The rest of the paper is organized as follows. Section 2 presents related work and
provides background information. Section 3 discusses the debugging by DSAS
approach with Matlab using rule based inference system. In Sect. 4 the results are
discussed. The paper is concluded in Sect. 5.
2 Related Work
JTAG
Trace
Probe
ILA Trace HDL SImulator
Points
monitored only after the core has been triggered and even after the trigger, a limited
debug window is available. Debugging with a small amount of sample data becomes
cumbersome. Furthermore, debugging is done by HDL simulators which require cost
and human intervention for debugging. Sometimes external logic analyzers can also be
used along with the ILA cores for enhancing the debugging capabilities but the solution
does not remain cost effective in such cases [8].
RS-232/ Matlab
Design Under Data Ethernet I/O /octave
Test Data acquisition system
Interface
(DUT) Computer
Golden Reference
(Matlab)
But as obvious from Fig. 3, the design process has to start from MATLAB.
However, if the algorithm is difficult or entirely impossible to implement in MATLAB,
the process of hardware generation cannot start.
Data I/O
Data Base
Probe
Inference Engine
Results
Rules
Input
User Interface
In this section, a new methodology for debugging is presented. In the scope of this
work, a processor-based debugging system is utilized (ARM in case of Xilinx Zynq
device or Microblaze for rest of Xilinx FPGA families) to collect the data from onboard
trace buffers (DSAS approach) [20]. Once the trace buffers are full, the DUT is stopped
by the clock manager and then the data is transferred to the terminal through Ethernet.
The saved data is used by MATLAB-based software debugging system utilizing a
rule-based inference system approach. A block diagram of the hardware-software
co-debugging methodology is shown in Fig. 5.
Debugging data
transfer to trace buffer Read and Write Memory
Clock for DUT Arbiter
FIFO 1 Read/
Buffer
. status
write
D Signal Enable
Clock .
Clock U Selector MUX Trace Buffer DMA Processor
Manager
T MxN
. Nx1
.
FIFO N Multiplexed data transfer
Server
Device Stop signal
Matlab based
Inference system
Normally, a debug system can only show the monitored data (limited to the window
size) and then the decision making process is left to the user. But if the debugging
system has an unlimited window as promised by DSAS, monitoring millions of samples
by the user may be tiresome. This necessitates the use of verification software for
debugging. However, the main bottleneck for FPGA-based designs in using verification
FPGA Debugging with MATLAB Using a Rule-Based Inference System 111
by software methods is the data transfer rate limitation. This is because the designs
operate at very high frequencies and the data transfer between the FPGA and verification
software cannot be as fast as the FPGA operating frequency. Adopting DSAS approach
resolves the issue because the DUT is stopped during the data transfer from the FPGA to
the terminal. Hence, rule-based inference system utilizing the power of MATLAB can
be used very efficiently along with DSAS approach which cannot only monitor the
output but also make the decision about the qualification of system as well.
The main benefits of this technique are no loss of debugging data due to an
unlimited debug window, no use of HDL simulators for waveform viewing and shorter
debugging time by using verification by a software technique.
Clock
3.1.2 The second DUT is a CORDIC core [22] used with a Microblaze soft pro-
cessor. Microblaze reads data from a file and then sends the data to the CORDIC core.
Different mathematical operations were performed by the CORDIC core before the data
is sent back to the Microblaze (Fig. 7).
3.2 Interfacing
The debugging system hardware is connected to the terminal through Ethernet using
UDP protocol [23]. Once the debugging data is received on the terminal platform, it is
used by MATLAB for debugging. In order to control the whole process of debugging
and streamlining the process, a graphical user interface has been developed using
MATLAB GUIDE [24]. The GUI front panel is shown in Fig. 8.
112 H.H. Khan and D. Göhringer
BRAMs
Data to DSAS
debugging
system
Debugging
Data
Knowledge Base
Inference
Learning by Model
Engine
Output
Data Mining
output is expected, it is most appropriate for the user to provide the data set for pattern
matching. Based upon the rules set, the inference engine calculates the similarity
between the debugging data and the dataset from the knowledge base. Depending upon
the DUT type, regression (multiple linear regression), correlation (linear or rank cor-
relation) or cross correlation can also be selected as a rule. For the current research
work the inference engine calculates the cross correlation (rule) between the debugging
data and user defined dataset and displays the result. A cross correlation of 1.0 depicts a
match. In cases where the user does not know the output, it is also possible to make the
debugging system learn from any available identical system. If either of the two data
options are not available, the debugging system can mine for one in the database (if a
similar system was debugged in the past and data saved to the knowledge base). If
relevant dataset is not found then debugging data will be displayed without any overlay
or rule application. In such cases the debugging data is saved in the data base for future
reference if the debugging data has unique nomenclature.
Once the relevant data has been loaded to the inference engine, the engine calculates
cross correlation between the debugging data and database. The result of cross corre-
lation function is an array of values showing the similarity between debugging data and
the database. The maximum cross correlation value is achieved when the two datasets
match perfectly. Using the MATLAB functions, lag between the two datasets can also
be found. A correlation of 0.0 depicts no match. A cross correlation of 1.0 depicts a
perfect match that means one dataset can be derived from the other either directly or
using a positive scale factor. A correlation of −1.0 depicts max negative correlation (that
means one dataset can be derived from the other using a negative scale factor). Values
between 0 and 1 show a partial match. A correlation value (>0.90) may indicate very
good similarity between the two dataset [25] (depends upon the use case) but for
debugging purposes a perfect match is required. The inference engine can also indicate
the best match instance which can be used as a starting point for debugging. Hence by
using rule-based expert system, debugging becomes easier and saves a lot of time.
The main advantage of this debugging methodology is that unlike limited window
based debugging systems; the DSAS approach can have extremely large data set. It can
monitor 16 signals (for the current research work but not limited to 16) simultaneously
with each signal having millions of points; comparing such large number of transitions
manually may become cumbersome because each transition needs to be checked with
the corresponding clock cycle (sample number in this case). But adopting the
rule-based inference system methodology, debugging becomes easy once the knowl-
edge base has been populated with appropriate data; because the system carries out the
cross correlation (or any appropriate rule) and displays the results. Furthermore, the
debugging system plots the debugging data with relevant data overlay for easing the
debugging process.
4 Results
The proposed debugging approach has been tested with 2 different designs: An image
processing application and a Microblaze-based CORDIC application. MATLAB plot
of the image processing design without inference system application is shown in
114 H.H. Khan and D. Göhringer
Fig. 10. The design was operated at 100 MHz. Hence each sample corresponds to a
clock cycle of 10 ns. Input (pixel in) is shown in first subplot. After processing the
input data, corresponding Img out is shown in the second subplot. In third subplot, the
Valid out remains zero initially because the window generator needs to be filled before
valid data can be acquired at the output. As can be seen in the third subplot, Valid data
turns to 1 after (6w + 6) i.e. 6006 samples (where w is the image width) indicating that
the filter has a valid output. The data remains valid for (w − 6) i.e. 994 samples and
then again becomes invalid for 6 samples (kernel size −1). This pattern continues for
the whole length of the image. If the design is required to be reset, the reset needs to be
transitioned to 1. In order to keep the design enabled during debugging, the enable
should be 1. It can be noticed that more than 135,000 samples of each signal has been
acquired. (5 signals are shown in the figure however 16 signals were monitored for the
current research work).
In Fig. 11, debugging data has been plotted besides the dataset from knowledge
base. MATLAB facilitates mathematical modelling of any system greatly. However, in
case a mathematical model is not available or modelling is time consuming, data from
any similar design can suffice for knowledge base generation. For the current research
work, the knowledge base has been populated from the data acquired by learning from
a similar system. However, if a similar system is not present, user can input his own
template for populating the knowledge base because expected output is generally
known.
Furthermore, if the knowledge base is devoid of any template, still the option for
manual debugging is available in contrast to other verification by software debugging
methodologies where debugging is not possible in absence of the GR model. When the
user is satisfied with the output, the data base can be populated for future use.
FPGA Debugging with MATLAB Using a Rule-Based Inference System 115
Once the knowledge base has been populated with corresponding dataset, inference
engine calculates cross correlation between debugging data and the knowledge base
dataset and displays the results. A plot of the output of rule-based inference system is
shown in Fig. 12.
As can be seen in Fig. 12, the maximum cross correlation between the two datasets
is 1.0 for the plotted data and the lag between the two datasets is zero, hence a perfect
match exists between the debugging data and the knowledge base. However, if the
maximum cross correlation is less than 1.0 which indicates some disparity between the
debugging data and the knowledge base dataset, analyzing the data becomes important.
In such cases, the lag against the maximum value can be used as a starting point for
debugging. Manual comparison of such large datasets would have been time con-
suming. But rule-based inference system has made the debugging process fast and
efficient.
5 Conclusions
References
1. Hung, E., Wilton, S.J.: Towards simulator-like observability for FPGAs: a virtual overlay
network for trace-buffers. In: Proceedings of the ACM/SIGDA International Symposium on
Field Programmable Gate Arrays (2013)
2. Asaad, S., Bellofatto, R., Brezzo, B., Haymes, C., Kapur, M., Parker, B., Roewer, T.,
Saha, P., Takken, T., Tierno, J.: A cycle-accurate, cycle-reproducible multi-FPGA system
for accelerating multi-core processor simulation. In: Proceedings of the ACM/SIGDA
International Symposium on Field Programmable Gate Arrays (2012)
3. Herrmann, A., Nugent, G.P.: Embedded logic analyzer for a programmable logic device. U.
S. Patent No. 6,389,558, May 2002
4. Altera Inc.: On-chip design verification with Xilinx FPGAs, Agilent Application Note 1456,
April 2003
5. Arshak, K., Jafer, E., Ibala, C.: Testing FPGA based digital system using XILINX
ChipScope logic analyzer. In: IEEE 29th International Spring Seminar on Electronics
Technology (2006)
6. Kuijsten, H.: Method and apparatus for a trace buffer in an emulation system. U.S. Patent
No. 5,680,583, 21 October 1997
7. Woodward, J.: In-circuit debug of FPGAs. CMP Media LLC N. Y. Embed. Syst. Eur. 7(49),
16–17 (2003)
8. Agilent Technologies Inc.: Deep storage with Xilinx ChipScope Pro and Agilent
Technologies FPGA Trace Port Analyzer. Agilent Product Overview 5988-7352EN,
February 2003
FPGA Debugging with MATLAB Using a Rule-Based Inference System 117
1 Introduction
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 118–128, 2017.
DOI: 10.1007/978-3-319-56258-2 11
Hardness Analysis and Instrumentation of Verilog Gate Level Code 119
Fig. 1. Points of modification in the FPGA development flow [6] (Color figure online)
reduced as well. This makes the device more prone to soft errors [4]. For FPGA-
based systems, simulation and emulation based methods are usually applied for
testing, verification and validation of designs. Therefore, testing and the depend-
ability analysis of such systems are crucial. These procedures require deliberate
introduction of faults in target systems. The fault injection technique plays an
important role in the dependability evaluation and is a widely accepted solution
to perform SEU sensitivity analysis [2,5]. In the FPGA-based fault injection
process, there are different points in the design process, where faults can be
injected as shown by blue dashed lines in Fig. 1. Various tools have been devised
in the past several years, to inject faults in FPGA-based designs at various
locations for evaluating design characteristics. FPGA-based fault injection tools
have advantages of both physical and simulation-based technique, such as speed
and flexibility. There are two main groups of techniques, reconfiguration-based
and instrumentation-based [9]. The fault injection tools that work on the net-
list developed after the synthesis process are introduced in [10–12], and those
based on the reconfiguration technique are presented in [1,5,10,15]. Addition-
ally, there are some tools based on the instrumentation technique [13,17], and
hybrid techniques (simulation/emulation) [7,14,16].
In this work, the instrumentation-based fault injection methodology is devel-
oped in Matlab, and an experimental approach is also proposed, which identifies
the most sensitive part of the design for different fault models (e.g. bit-flip and
stuck-at 1/0).
Contributions
Start Program
Choose Fault
Model
# of copies of
SUT
Inject Faults
Yes to another
SUT ?
No
End Program
reg f0,f1,f2;
always @ (select) begin
if (select == 3'd0) begin DeMux- based
f0=fis;f1=0;f2=0;f3=0;f4=0;f5=0;f6=0; end FISA unit
else if (select == 3'd1) begin
f0=0;f1=fis;f2=0;f3=0;f4=0;f5=0;f6=0; end
. . .
. . .
else if (select == 3'd6) begin
f0=0;f1=0;f2=0;f3=0;f4=0;f5=0;f6=fis; end
else begin
f0=0;f1=0;f2=0;f3=0;f4=0;f5=0;f6=0; end
end
wire m,n,o;
and u0 (n,f0^m,f1^a); fault injection
and u1 (o,f2^b,f3^sel); in instances
not u2 (m,f4^sel);
or u3 (dout_f1,f5^n,f6^o);
endmodule
The sensitive location is the location in a SUT, where occurrence of any type
of fault results in a failure. The sensitive locations of the SUT are obtained
using the following proposed experimental approach. According to this approach,
these locations are more or less equally sensitive to bit-flip and stuck-at (1/0)
faults. Some definitions must be considered in order to understand the proposed
approach.
4 RASP-FIT Tool
RASP-FIT tool is designed in order to the test, the fault detection and the
dependability analysis of FPGA-based systems. It stands for “RechnerArchitek-
tur und SystemProgrammierung-Fault Injection Tool”. It is developed in Mat-
lab using its GUI environment. In general, the fault injection method should be
highly effective for validating and demonstrating the design characteristics and
robustness in the presence of faults [18]. In order to ease of use, a standalone
Matlab GUI is developed for the proposed tool using deploytool command. The
complete flow chart of the proposed tool is shown in Fig. 4.
Open RASP-
FIT Tool
Welcome
Screen
# of Copies
of SUT
Report
Generated!!!
The RASP-FIT tool accepts Verilog *.v file and injects bit-flip and stuck-at
1/0 faults in all possible locations in the SUT. These files contain the code for
the original and faulty copies separately. Table 1 describes the results of the
fault injection algorithm applied on various SUTs from ISCAS’85 and ISCAS’89
circuits for bit-flip and stuck-at 1/0 fault models. These benchmark circuits
are widely used for different purposes e.g. testing and fault injection analysis.
Hardness Analysis and Instrumentation of Verilog Gate Level Code 125
Design File
Verilog Code Fault Injection
Faulty
(SUT) Algorithm Xilinx ISE
Copies
+
ModelSim
Hardness Top Design
Analysis File
Sensitive
Locations
Compaction
Compacted
Test Vectors
277
Number of Sensitive Locations
103 99 103
100 97
68
50
35 32
5
0 0 4 0 0 3 0
0
c17 c432 c499 c880
Systems Under Test for Various Hardness Levels
Threshold=30% Threshold=50%
Threshold=70% Threshold=90%
The number of select pins required for select inputs is also shown in the table.
In the previously proposed ATPG test method, faults were injected using this
tool, and it was presented in our work [8].
RASP-FIT tool is developed in Matlab, while simulation environment is cre-
ated using Xilinx ISE and ModelSim softwares as shown in Fig. 5. Combinational
digital systems are considered for hardness analysis in this paper. If the hard-
ness of a fault results in 100%, it means the fault is not detectable for any input;
hence, it is called an untestable or undetectable fault. On the other side, a hard-
ness of 0% shows the detection of fault for all test vectors, which means that
the portion of the circuit where the fault has occurred is very sensitive to fault
attacks.
We consider four threshold levels and find out the sensitive locations for
each. Using these threshold values, we can obtain the most sensitive locations.
Table 2 shows various threshold levels and their respective numbers of sensitive
locations. These locations are obtained from the hardness matrix by comparing
126 A.R. Khatri et al.
its value for each fault model with a particular threshold value. We have used
four different threshold values to obtain different numbers of sensitive locations.
This information will be used in the development of redundant technique in the
next phase. These locations are obtained in a row vector with the corresponding
specific fault numbers. Figure 6 shows the graphical illustration of the results
provided in Table 2.
6 Conclusion
In this paper, some methodologies used in the development of the RASP-FIT
tool have been presented, which includes fault injection algorithm, and the
method for finding sensitive locations for FPGA-based designs. In this work,
the proposed fault injection algorithm has been validated on the Verilog gate
level designs for combinational ISCAS’85 and ISCAS’89 circuits. Also, the hard-
ness analysis method has been presented for combinational ISCAS’85 benchmark
circuits.
In the future, the fault injection algorithm will be developed for other abstrac-
tion levels and SoCs. Also, the hardness analysis method will be applied to the
sequential and microprocessor designs. Currently, the validation of this proposed
method and the redundant approach are in progress for these sensitive locations,
making the design more robust and fault-tolerant, without major area overhead
and power consumption.
References
1. Alderighi, M., Casini, F., D’Angelo, S., Mancini, M., Codinachs, D.M., Pastore, S.,
Poivey, C., Sechi, G.R., Weigand, G.S.R.: Experimental validation of fault injection
analyses by the FLIPPER tool. In: 2009 European Conference on Radiation and Its
Effects on Components and Systems (RADECS), Burges, Belgium, pp. 544–548,
September 2009
Hardness Analysis and Instrumentation of Verilog Gate Level Code 127
2. Alexandrescu, D., Sterpone, L., López-Ongil, C.: Fault injection and fault tolerance
methodologies for assessing device robustness and mitigating against ionizing radi-
ation. In: 2014 19th IEEE European Test Symposium (ETS), Paderborn, Germany,
pp. 1–6, May 2014
3. Corradi, G., Girardey, R., Becker, J.: Xilinx tools facilitate development of FPGA
applications for IEC61508. In: 2012 NASA/ESA Conference on Adaptive Hardware
and Systems (AHS), Erlangen, Germany, pp. 54–61, June 2012
4. Desogus, M., Sterpone, L., Codinachs, D.M.: Validation of a tool for estimating
the effects of soft-errors on modern SRAM-based FPGAs. In: 2014 IEEE 20th
International On-Line Testing Symposium (IOLTS), Platja d’Aro, Girona, Spain,
pp. 111–115, July 2014
5. Gosheblagh, R.O., Mohammadi, K.: Dynamic partial based single event upset
(SEU) injection platform on FPGA. Int. J. Comput. Appl. (0975–8887) 76(3),
19–24 (2013)
6. Graham, P., Nelson, B., Hutchings, B.: Instrumenting bitstreams for debugging
FPGA circuits. In: The 9th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines, FCCM 2001, Rohnert Park, CA, USA, pp. 41–50,
March 2001
7. Jeitler, M., Delvai, M., Reichor, S.: FuSE - a hardware accelerated HDL fault
injection tool. In: 5th Southern Conference on Programmable Logic, SPL, Sao
Carlos, Brazil, pp. 89–94, April 2009
8. Khatri, A.R., Hayek, A., Börcsök, J.: ATPG method with a hybrid compaction
technique for combinational digital systems. In: 2016 SAI Computing Conference
(SAI), London, UK, pp. 924–930, July 2016
9. Khatri, A.R., Milde, M., Hayek, A., Börcsök, J.: Instrumentation technique for
FPGA based fault injection tool. In: 5th International Conference on Design and
Product Development (ICDPD 2014), Istanbul, Turkey, pp. 68–74, December 2014
10. Mansour, W., Aguirre, M.A., Guzmán-Miranda, H., Barrientos, J., Velazco, R.:
Two complementary approaches for studying the effects of SEUs on HDL-based
designs. In: 2014 IEEE 20th International On-Line Testing Symposium (IOLTS),
Platja d’Aro, Catalunya, Spain, pp. 220–221, July 2014
11. Mansour, W., Velazco, R.: An automated SEU fault-injection method and tool for
HDL-based designs. IEEE Trans. Nuclear Sci. 60(4), 2728–2733 (2013)
12. Mansour, W., Velazco, R., Ayoubi, R., Ziade, H., Falou, W.E.: A method and an
automated tool to perform SET fault-injection on HDL-based designs. In: 2013
25th International Conference on Microelectronics (ICM), Beirut, Lebanon, pp.
1–4, December 2013
13. Mansour, W., Velazco, R.: SEU fault-injection in VHDL-based processors: a case
study. J. Electron. Test. 29(1), 87–94 (2013)
14. Mohammadi, A., Ebrahimi, M., Ejlali, A., Miremadi, S.G.: SCFIT: a FPGA-based
fault injection technique for SEU fault model. In: 2012 Design, Automation Test
in Europe Conference Exhibition (DATE), Dresden, Germany, pp. 586–589, March
2012
15. Nápoles, J., Mogollón, J.M., Barrientos, J., Sanz, L., Aguirre, M.A.: FT-
UNSHADES2: a platform for early evaluation of ASIC and FPGA dependabil-
ity using partial reconfiguration. In: La Sociedad de Arquitectura y Tecnologa de
Computadores, pp. 1–5 (2012)
16. Rahbaran, B., Steininger, A., Handl, T.: Built-in fault injection in hardware - the
FIDYCO example. In: Second IEEE International Workshop on Proceedings of
Electronic Design, Test and Applications, DELTA 2004, Perth, WA, Australia, pp.
327–332, January 2004
128 A.R. Khatri et al.
17. Shokrolah-Shirazi, M., Miremadi, S.G.: FPGA-based fault injection into synthesiz-
able Verilog HDL models. In: Second International Conference on Secure System
Integration and Reliability Improvement, SSIRI 2008, Yokohama, Japan, pp. 143–
149, July 2008
18. Wulf, N., Cieslewski, G., Gordon-Ross, A., George, A.D.: SCIPS: an emulation
methodology for fault injection in processor caches. In: 2011 IEEE on Aerospace
Conference, Big Sky, MT, USA, pp. 1–9, March 2011
A Framework for High Level Simulation
and Optimization of Coarse-Grained
Reconfigurable Architectures
1 Introduction
Reconfigurable architectures have evolved greatly in recent years. Some
approaches use the standard fine-grained reconfigurable architectures like com-
mercial FPGAs, while others contain hardcore processors coupled with softcore
reconfigurable coprocessors (e.g., GARP [1]). Similarly, coarse-grained reconfig-
urable architectures (CGRAs) have attracted a lot of attention from the research
community as well and there has been extensive work in the domain applica-
tion to CGRA mapping (e.g. [2,3], etc.). CGRAs comprise of predefined hard-
core Processing Elements (PEs) to provide computational power. Because the
PEs are capable of doing byte or word-level computations, CGRAs can pro-
vide higher performance (in terms of latency) for data intensive applications,
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 129–137, 2017.
DOI: 10.1007/978-3-319-56258-2 12
130 M.A. Pasha et al.
such as image, video and digital signal processing (DSP) when compared with
fine-grained architectures like FPGAs. Moreover, being coarse grained in nature,
CGRAs also incur smaller reconfiguration overheads.
However, there has been a parallel development in design automation of fine-
grained architectures such as academic FPGAs. Manual design and optimization
of reconfigurable architectures remains a daunting task and there is a need for
automated design-flows that take a set of target applications at higher level (e.g.
C or C++) and generate hardware descriptions of possible target reconfigurable
platforms that can then be synthesized by any standard synthesis tool to get the
final hardware.
On the other hand, if we look at design automation tools for CGRAs, exten-
sive work has been done in the area of architecture optimization where peo-
ple proposed various architectural templates suited for a set of target appli-
cations [4,5]. The other major research direction is application to architecture
mapping where researchers have tried to optimize different design constraints like
mapping time or resource optimization of a selected CGRA template [2,3]. To
the best of our knowledge, there exists no high-level simulation and optimiza-
tion design-flow targeted at CGRAs that start from C and ends at hardware
description for a custom CGRA. In this work, we address this aspect through
our proposed framework.
If we look at the architectural aspects of CGRAs, based on organization
of PEs, the CGRAs can be classified into two types (i) linear array architec-
ture and (ii) 2-D mesh-based architecture. In linear array architecture, PEs are
organized in one or several linear arrays while in mesh-based architecture, the
PEs are arranged in a two-dimensional space much like any standard FPGA.
PipeRench [5] is an example from former class while PACT-XPP [4] represents
one example from latter category.
In this paper, we propose a generalized framework that can be used for high-
level simulation, optimization and resource (power & area) estimation of homoge-
neous mesh-based CGRAs. We used several codes from data/compute-intensive
application benchmark suite MiBench [6] and generated custom homogeneous
mesh-based CGRAs for target applications.
The rest of the paper is organized as follows: we start by presenting the
related work in Sect. 2 and describe the details of proposed approach in Sect. 3.
Section 4 details the implementation and simulation results for sample bench-
mark applications. We, then, conclude and draw future research directions in
Sect. 5.
2 Related Work
Our focus, in the proposed framework, is on homogeneous mesh-based CGRAs
since they provide more efficiency than linear arrays for DSP and multimedia
applications. As far as the frameworks for mesh-based CGRAs mapping are
concerned, Lee et al. [2] proposed an application mapping framework for 2-D
mesh-based CGRAs supporting both integer and floating point arithmetic. They
A Framework for High Level Simulation and Optimization of CGRAs 131
3 Proposed Approach
3.1 Basic CGRA Template
Like the renowned homogeneous mesh-based CGRA, PACT-XPP [4], the basic
Processing Element (PE) of our target CGRA architecture is an “arithmetic logic
unit” (ALU) as shown in Fig. 1(a). This 8-bit ALU is capable of performing eight
(8) distinct logic and arithmetic operations. These ALUs are surrounded by hor-
izontal and vertical routing channels forming a generic routing fabric where the
communication between PEs is ensured through programmable routing resources
and connection with I/Os and memory is maintained through programmable
I/O blocks. Figure 1(b) shows an abstract level view of an overall homogeneous
CGRA fabric.
Left PE PE PE PE
AND OR NOT
SHIFT
0 1 2 3 4 5 6 7
8-bit 8:1 MUX ALU
Result
(a) Logic design of a basic ALU block. (b) Basic block diagram of 2-D mesh-based CGRA.
Fig. 1. Block diagram of mesh-based CGRA and logic design of an ALU block.
Table 1. Placement and routing time comparison between CGRA and FPGA
4 Experimental Results
This section presents the experimental results of generating both custom CGRAs
and FPGAs for different input applications. For CGRAs, we used our proposed
design-flow while for academic FPGAs, an open-source tool targeted at FPGA
A Framework for High Level Simulation and Optimization of CGRAs 133
GeCoS
High Level
Retargetable DAG DAG to Net-
Descriptions in
Compilation Extraction List Parser
ANSI-C
CDFG IR
Net-lists
In this work, we have used four benchmarks for our experimentation (as shown
in Table 1). For CGRA mapping, the flow described in Sect. 3 is used whereas
for FPGA mapping flow discussed in [8] is used. It can be seen that our pro-
posed framework gives either equal or better mapping time results. Finally, if
134 M.A. Pasha et al.
we consider average time taken for both architectures, CGRA framework takes
1290 s for four benchmarks while FPGA framework takes 3522 s. This gives an
average mapping time gain of 63.3% for CGRAs over FPGAs.
4.2 Area and Power Results for CGRA vs. FPGA Implementations
For the sake of completeness, we also present the area and power consumption
results of CGRAs and FPGAs. These results are obtained using respective flows
of CGRAs and FPGAs, and they are summarized in Tables 2 and 3 for common
benchmarks mapped on CGRAs and FPGAs respectively.
Table 2 shows that for each benchmark under consideration, individual
CGRA architecture was created. Results of individual benchmarks are shown
in lines 1 to 4 of the table. For each benchmark, first a CGRA architecture is
defined that best suits the logic requirements of the benchmark. Benchmark is
then placed and routed on the defined CGRA architecture using our proposed
flow. The flow used in this work optimizes the resources of the architecture and
culminates with the area and power estimations of the architecture. Area of a
CGRA architecture in this work is mainly divided into two parts: logic area and
routing area. Logic area of CGRA is calculated as the sum of logic area of all
the ALUs present in the architecture. Routing area of CGRA is calculated as
the sum of area of all the routing components in the CGRA architecture. When
the design-flow of CGRA is terminated after optimization, number of routing
components and their areas are combined together to give the overall routing
area of the architecture (ref columns 2–4, and 6 of Table 2). Logic area and
routing area values are finally combined to give total area of architecture (ref
column 7 of Table 2). Dynamic power consumption values are given in column
8 of Table 2. Results shown in lines 1–4 of the table are individual benchmark
results that give an idea about the area and power requirements of each bench-
mark separately. However, a combined CGRA architecture was also defined that
satisfies the requirements of all the applications under consideration. The com-
bined CGRA results are shown in line 5 of Table 2. It can be seen from this table
that “IDCT” has the largest logic & routing resource requirements and a CGRA
architecture satisfying the needs of this application can satisfy the needs of all
the netlists of the set under consideration.
Area and dynamic power results of the common benchmarks for FPGAs are
shown in Table 3. To have a fair comparison, we generated the results for both
individual as well as combined FPGA architectures. It can be seen from Tables 2
and 3 that for combined architecture CGRAs consume 63% less SRAMs. This is
because of the fact that due to bus-based routing structure of CGRAs, they have
shared configuration memory cells for their routing switches which eventually
leads to smaller number of SRAMs required for routing architecture.
However, due to this very nature of routing structure of CGRAs, channel
width is significantly increased that results in much larger requirement of multi-
plexers. For combined architecture, CGRAs require 70% more multiplexers than
FPGAs. Although smaller number of SRAMs are required for combined CGRA,
but the area of individual SRAM is much smaller as compared to area of a
A Framework for High Level Simulation and Optimization of CGRAs 135
2×1 MUX; hence the area gain of SRAMs is far outweighed by additional area
caused by large multiplexer requirement. These effects, combined together with
buffer area, result in 63% smaller routing area for combined FPGA compared
to combined CGRA implementation. As far as logic area is concerned, due to
coarser granularity of CGRAs, number of ALUs required are much lesser than
the number of CLBs required for FPGA-based implementations. Hence despite
the fact that the area of individual ALU is larger than the area of a single CLB,
we eventually get a 27.5% reduction in logic area for CGRAs when compared to
FPGAs. However, it is important to mention, here, that routing area of CGRAs
comprises of 90% of total area. Hence, smaller logic area of CGRA is overshad-
owed by larger routing area and finally gives 53.5% smaller FPGA architecture
while consuming 54.8% less dynamic power when compared to CGRAs.
Results presented in this section suggest that CGRAs are, on average, 63.3%
more efficient than FPGAs in terms of required placement and routing time. This
136 M.A. Pasha et al.
is because of the less complex nature of CGRA fabric. However, contrary to [10],
interconnect overhead of our proposed CGRA is relatively high. This is because
of the fact that our proposed framework is based on a generic environment
that uses architecture independent placement and routing algorithms. These
algorithms can be used for exploration of logic and routing resources of CGRA
architectures. Due to flexible nature of underlying algorithms, CGRAs in our
work, are based on general purpose programmable interconnects when compared
with fixed interconnects presented in the literature [10].
5 Conclusion
Compared to ASICs, FPGAs are slower and less power-efficient, but their edge
over ASIC is their programmability and flexibility. One reason for their slow
performance is their finer granularity. The potential solution is CGRAs who
operate at word-level. However, high-level simulation tools targeted at CGRAs
are nearly non-existent. This paper presents a complete high-level framework for
simulation, optimization and resource estimation of mesh-based CGRAs. As a
case study, we used embedded DSP and CGRA application benchmarks. The
results show that auto-generated homogeneous mesh-based CGRAs consume
54% more area when compared with auto-generated academic FPGAs while
providing around 63.3% faster mapping.
References
1. Hauser, J.R., Wawrzynek, J.: Garp: a MIPS processor with a reconfigurable
coprocessor. In: Proceedings the 5th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines, pp. 12–21, April 1997
2. Lee, G., Choi, K., Dutt, N.D.: Mapping multi-domain applications onto coarse-
grained reconfigurable architectures. IEEE Trans. Comput. Aided Des. Integr. Circ.
Syst. 30(5), 637–650 (2011)
3. Peyret, T., Corre, G., Thevenin, M., Martin, K., Coussy, P.: An automated design
approach to map applications on CGRAs. In: Proceedings of the 24th Edition of
the Great Lakes Symposium on VLSI, ser. GLSVLSI 2014, pp. 229–230. ACM,
New York (2014)
4. Baumgarte, V., Ehlers, G., May, F., Nückel, A., Vorbach, M., Weinhardt, M.:
PACT XPP - a self-reconfigurable data processing architecture. J. Supercomput.
26(2), 167–184 (2003)
5. Goldstein, S.C., Schmit, H., Moe, M., Budiu, M., Cadambi, S., Taylor, R.R.,
Laufer, R.: PipeRench: a coprocessor for streaming multimedia acceleration. In:
Proceedings of the 26th IEEE ISCA, pp. 28–39 (1999)
6. Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge, T., Brown, R.B.:
MiBench: a free, commercially representative embedded benchmark suite. In:
IEEE International Workshop on Workload Characterization (WWC-4), pp. 3–14,
December 2001
7. L’Hours, L.: Generating efficient custom FPGA soft-cores for control-dominated
applications. In: Proceedings of the 16th IEEE ASAP, pp. 127–133 (2005)
A Framework for High Level Simulation and Optimization of CGRAs 137
8. Pasha, M.A., Farooq, U., Siddiqui, M.B.: A design-flow for high-level synthesis and
resource estimation of reconfigurable architectures. In: 10th International Confer-
ence on Design & Technology of Integrated Systems in Nanoscale Era (DTIS),
Naples, Italy, pp. 1–6. IEEE (2015)
9. Zhao, Y.C.W.: New generation of predictive technology model for sub-45nm early
design exploration. IEEE Trans. Electron Devices 53(11), 2816–2823 (2006)
10. Zhang, C., Lenart, T., Svensson, H., Öwall, V.: Design of coarse-grained dynami-
cally reconfigurable architecture for DSP applications. In: International Conference
on Reconfigurable Computing and FPGAs, pp. 338–343 (2009)
Design Space Exploration
Parameter Sensitivity in Virtual FPGA
Architectures
1 Introduction
During the last three decades, Field Programmable Gate Arrays (FPGAs) have
evolved from less competitive and prototyping devices with as little as 64 logic
cells towards complex System on Chip (SoC) and massive parallel digital sig-
nal processing architectures. The functional density alone, however, is not the
unique selling point and there is still a considerable gap to ASICs in this regard.
Moreover, it is the flexibility and the comparably short design times along with
low NRE costs and low risks that make FPGAs so attractive. Currently, we are
witnessing a new movement towards general purpose computing. The signs are
conspicious considering the facts that (1) there is a trend towards heterogeneous
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 141–153, 2017.
DOI: 10.1007/978-3-319-56258-2 13
142 P. Figuli et al.
2 Related Works
2.1 Virtual FPGA Architectures
N-1
N-1
0
InMux
...
InMux InMux InMux InMux InMux OutMux OutMux
... (N*K- ... ... ... ... ...
(N*K-1) (2*K-1) (K) (K-1) (0) (O_CLB-1) (0) ... ...
K)
0
inMuxConfig
N-1
configuration unit
lutK_reg lutK_reg lutK_reg lutK_reg lutK_reg
...
ffen
(2^K-1) (2^K-2) (...) (1) (0)
0 I+N-1
...
I
K-1
I-1 .
.
...
0
.
BLE_out_intern(N-1)
...
1
LUT D Q
I+N-1
nRST
...
I
0 clk BLE_N-1
I-1
...
0 MOSI_out
...
SCLK nSS MOSI
configuration unit
BLE_out_intern
lutK_reg lutK_reg lutK_reg lutK_reg lutK_reg
ffen
(2^K-1) (2^K-2) (...) (1) (0)
I+N-1
...
I
K-1
I-1 .
.
...
0
.
BLE_out_intern(1)
...
1
LUT D Q
I+N-1
nRST
...
I
0 clk BLE_1
I-1
...
0 MOSI_out
configuration unit
lutK_reg lutK_reg lutK_reg lutK_reg lutK_reg
ffen
(2^K-1) (2^K-2) (...) (1) (0)
I+N-1
...
I N-1
K-1
I-1 .
...
.
...
0
.
BLE_out_intern(0)
...
0
1
LUT D Q
I+N-1
nRST
...
I
0 clk BLE_0
I-1
...
0 MOSI_out
N-1
... ...
...
N-1
N-1
0
inputVector
Connection boxes around the CLBs consist mainly of multiplexers and their
select signals are controlled by configuration registers. At the same time only
one routing track can be connected to the input through CBr, whereas several
tracks can be connected to the same output through CBw. PSMs realize the
global routing of the signal paths by connecting tracks from different channels
at the intersections. Therefore a 4:1 MUX is located at each output of a PSM as
shown in Fig. 3(a). A PSM has on each side W in- and outputs, whereby W is the
channel width. On the left and bottom sides, the first position of the MUX is the
logic level ‘1’, which is the defined idle value of the routing infrastructure, i.e. if
there is no routing intended in this direction. The three remaining positions are
each associated with an input from one of the three adjacent sides. The two select
lines of the MUX are controlled by configuration registers set by the configuration
unit during programming. On the top and right sides of the PSM, the inputs
can be fed back to the outputs of the same sides by selecting the first position of
the respective multiplexers. This technique, which we call loopback propagation
enables emulation of bi-directional tracks using uni-directional tracks.
IOBs on the perimeter of the array have exactly one in- and one output and work
in a similar way like the connection boxes of the CLBs. As shown in Fig. 3 (b),
a MUX connects one of the tracks from the routing channel to the output pad.
When an output is not assigned, logic ‘0’ is issued by an AND gate connected
between the MUX and the configuration register bit ren. In favour of higher
routability, the input pad can be connected to several tracks in parallel through
respective 2:1 MUXs. All the MUXs are controlled by configuration registers.
Parameter Sensitivity in Virtual FPGA Architectures 147
TM U X4 = 2 · TM U X2 + Tnet (9)
TLU T = Tnet + K · (TM U X2 + Tnet ) (10)
TBLE outM U X = log2 (O) · (TM U X2 + Tnet ) (11)
I
TBLE inM U X = log2 N+ · (TM U X2 + Tnet ) (12)
K
TIOB in = TM U X2 + Tnet (13)
TIOB out = (log2 (W ) − 1) · (TM U X2 + Tnet ) + TAN D2 + Tnet (14)
These models target a fine grained underlying platform (e.g. the 3-input Versa-
Tiles in Actel ProASIC3) and need to be slightly modified when the underlying
platform changes. For instance, for an underlying platform with 6-input LUTs,
a 4:1 MUX becomes an MSBE as it will have the same area and timing as a
2:1 MUX (both can be realized by 1 LUT). Note that the additive MSBE based
models are pessimistic as they don’t reflect possible LUT sharing techniques.
follow. The steps of packing, placement and routing require information about
the target virtual FPGA architecture and the parameters and constants related
to area and delay models, which are provided through architecture files. Some
of the area and delay model equations in Sect. 4 are dependent on W , which is
known only after the routing process. Thus initially the estimated W is used.
For an improved accuracy, a feedback is needed to update the architecture file
with the actual channel width W and to re-run the area- and/or timing-driven
P&R processes. The results in terms of array size, channel width, area, critical
path delay are stored in a data base for assessing the figure of merit (FOM).
Then the process is repeated with other combinations of N and K in a nested
loop to span the design space of interest.
Bench- no yes
START K=K+1 K=K_max? END
marks
yes
no Assessment
Technology Database
K=2 N=N+1 N=N_max?
Mapping (FOM)
yes
W_min=
N=1 Packing Placement Routing W_est?
no
Estimate
arch. file
channel width
W_est
Fig. 5. Effects of LUT size K on (a) area (b) performance (c) area-delay product.
of cluster sizes with N = 1..10 and LUT inputs with K = 2..8. Figures 6, 7
present the resulting variances in area and performance. Interestingly, quite a
few area curves have a sawtooth characteristic with minima at N = 1 for all K
indicating that clustering is harmful for the respective benchmarks if area effi-
ciency is the objective. For the average case starting with N = 4 and K = 2, N
should decrease with increasing K for better area efficiency. On the whole, the
performance increases with rising K and N . The evaluation also shows a strong
parameter sensitivity with variances up to ±95.9% in area and ±78.1% in perfor-
mance. Furthermore, the fluctuating benchmark curves confirm that application
specific customization can yield high optimizations, rather than relying on aver-
age values for parameterization of the architecture.
7 Conclusions
In this paper an extended version of the V-FPGA has been introduced and the
area and delay models suitable for vitualization have been derived by decompos-
ing the architecture into MSBEs. In contrast to the existing models which are
based on transistor-level, the new models adopt the characterization of MSBEs
that are mapped onto the desired underlying COTS FPGA. Thus they represent
a more realistic view to the new design space exploration methodology and also
to the CAD tools for application mapping. The analysis of over 1400 benchmark-
runs with various combinations of LUT size and cluster size reveals a high para-
meter sensitivity with individual variances up to ±95.9 % in area and ±78.1 %
in performance. This proves a remarkable potential for application specific opti-
mizations through parameter tuning. For general purpose cases, an averaging of
area-delay products over the examined benchmarks leads to recommendations
of K = 5..7 for unclustered logic CLBs and combinations of K = 4..7 with
N = 2..5 for clustered CLBs. However if the target application field is narrow,
it is not recommended to rely on averaging as the individual benchmarks differ
tremendously from the average values. Furthermore, our results show some dis-
crepancy in the parameter recommendations of physical FPGAs and discourage
a 1:1 adoption to virtual FPGAs.
References
1. MEANDER Design Framework (2016). http://proteas.microlab.ntua.gr/meander/
download/index.htm. Accessed 25 Nov 2016
2. ZUMA Repository 2016. https://github.com/adbrant/zuma-fpga/tree/master/
source/templates. Accessed 25 Nov 2016
3. Ahmed, E., Rose, J.: The effect of LUT and cluster size on deep-submicron FPGA
performance and density. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 12(3),
288–298 (2004)
4. Betz, V., Rose, J.: Cluster-based logic blocks for FPGAs: area-efficiency vs. input
sharing and size. In: Custom Integrated Circuits Conference, Proceedings of the
IEEE 1997, pp. 551–554 (1997)
5. Brant, A., Lemieux, G.G.F.: ZUMA: an open FPGA overlay architecture. In: Field-
Programmable Custom Computing Machines (FCCM), April 2012
6. Figuli, P., Huebner, M., et al.: A heterogeneous SoC architecture with embedded
virtual FPGA cores and runtime core fusion. In: NASA/ESA 6th Conference on
Adaptive Hardware and Systems (AHS 2011), June 2011
7. Gao, H., Yang, Y., Ma, X., Dong, G.: Analysis of the effect of LUT size on FPGA
area and delay using theoretical derivations. In: Sixth International Symposium on
Quality Electronic Design (ISQED 2005), pp. 370–374, March 2005
8. Gupta, P.K.: Accelerating datacenter workloads. In: 26th International Conference
on Field Programmable Logic and Applications (FPL), August 2016
Parameter Sensitivity in Virtual FPGA Architectures 153
9. Huebner, M., Figuli, P., Girardey, R., Soudris, D., Siozos, K., Becker, J.: A het-
erogeneous multicore system on chip with run-time reconfigurable virtual FPGA
architecture. In: 18th Reconfigurable Architectures Workshop, May 2011
10. Lagadec, L., Lavenier, D., Fabiani, E., Pottier, B.: Placing, routing, and editing
virtual FPGAs. In: Brebner, G., Woods, R. (eds.) FPL 2001. LNCS, vol. 2147,
pp. 357–366. Springer, Heidelberg (2001). doi:10.1007/3-540-44687-7 37
11. Luu, J., Goeders, J., et al.: VTR 7.0: next generation architecture and CAD system
for FPGAs. ACM Trans. Reconfigurable Technol. Syst. 7(2), 6:1–6:30 (2014)
12. Lysecky, R., Miller, K., Vahid, F., Vissers, K.: Firm-core virtual FPGA for just-
in-time FPGA compilation. In: Proceedings of the 2005 ACM/SIGDA 13th Inter-
national Symposium on Field-programmable Gate Arrays, p. 271 (2005)
13. Putnam, A.: Large-scale reconfigurable computing in a Microsoft datacenter. In:
Proceedings of the 26th IEEE Symposium on High-Performance Chips (2014)
14. Rose, J., Francis, R.J., Lewis, D., Chow, P.: Architecture of field-programmable
gate arrays: the effect of logic block functionality on area efficiency. IEEE J. Solid-
State Circ. 25(5), 1217–1225 (1990)
15. Tang, X., Wang, L.: The effect of LUT size on nanometer FPGA architecture. In:
2012 IEEE 11th International Conference on Solid-State and Integrated Circuit
Technology (ICSICT), pp. 1–4, October 2012
Custom Framework for Run-Time Trading
Strategies
1 Introduction
In finance, a trading strategy is a fixed plan that is designed to achieve a prof-
itable return by buying or selling stock on certain markets. Numerous trading
strategies are employed in financial markets with many outcomes in mind - the
most common being the identification of market trends. Understanding the dif-
ferent market characteristics is a first step towards being able to identify and
measure them. This, in turn should link trend-following performance to the state
of these market characteristics. Finally, this might be a step towards devising a
way for a trend-following strategy to adapt to these changing market regimes.
Nowadays, with the advance of hardware acceleration devices such as field
programmable gate arrays (FPGAs), it is possible to attain high component den-
sity and low power consumption, while achieving minimal latency [1]. Most of
the existing solutions allow reconfiguration between different computations, but
do not take advantage of their Partial Reconfigurability (PR): the possibility to
reconfigure the device during the same computation. When using PR the appli-
cation is represented as a sequence of operations that do not need to (or cannot)
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 154–167, 2017.
DOI: 10.1007/978-3-319-56258-2 14
Custom Framework for Run-Time Trading Strategies 155
EM Ax = (α ∗ sx ) + (1 − α) ∗ EM Ax−1 (1)
156 A.-I. Funie et al.
where EM Ax is taken for each value (x + n) < i < (x + k − n), s is the data
set and α stands for the weight α = movingAverageLength+1
2
. In finance, it is
often used to detect trends in price, in particular by comparing two simple
moving averages: one over a long window and one over a short window [4].
Previous work has shown how to create a fully-pipelined simple 3-point mov-
ing average kernel [5]. However, this method has not been evaluated with
market data [6] and it only provides small window moving averages, com-
pared to what real-world applications need (e.g. 200-point moving average
to predict financial trends [4]).
(2) Price Rate of Change: The price rate of change (PROC) is a trading strategy
that measures the percentage change between the most recent price and the
price “n” periods in the past, using the following formula:
pnow − pn
∗ 100 (2)
ptoday−n
where pnow stands for the value of the closing price now and pn represents
the closing price value “n” periods ago. It is used by traders to confirm
price movements, detect divergences, and determine potential over-
bought/oversold areas.
(3) Bollinger Bands: A Bollinger Band (BB) is a band plotted two standard
deviations away from a simple moving average. Because standard deviation
is a measure of volatility, BBs adjust themselves to the market conditions,
as follows:
SM AX ± (ST DX ∗ 2) (3)
where SM AX stands for simple moving average for the past X closing prices
(we use EMA instead of SMA), and ST DX represents the standard deviation
of the past X closing prices. The closer the prices move to the upper band,
the more overbought the market, and the reverse is true for oversold market
identification.
In our study, we employ a BUY trading strategy. Taking the example of a
GBP/USD FX trading market, whenever we choose to BUY 50 units of GBP, we
will take the reverse position [3]. We compute our returns (which represent the
profit in $M ) on the CPU, after the FPGA provides the final trading position
decision, using the following formula:
Rc = (−dt ∗ pA
t + dt ∗ pt ) − tct
B
(4)
where dt takes the value 50 in our case. pAt and pt represent our two currency
B
prices, GBP and USD at time t, while tct stands for transaction costs. Transac-
tions costs are neglected for our simulated synthetic market data, however they
are included within our historical tick data.
DRAM DRAM
FPGA FPGA
Switch
TSm+1
TSm+2
TSm+1
TSm+2
TSx+1
TSx+2
TSx+1
TSx+2
TSm
TSm
... ... ...
TSn
TS1
TS2
TSx
... ... ...
TSn
TS1
TS2
TSx
CPU CPU
(a) (b)
Fig. 1. SRTS: (a) choose TS group A to run. (b) choose TS group B to run
Of = rt + dt (5)
Load Config
TSm+1
TSm+2
TSm+1
TSm+2
TSm+1
TSm+2
TSx+1
TSx+2
TSn
TS1
TS2
TS1
TS2
TSx
TSx
TSm
... ...
TSn
Fig. 2. FRTS: (a) kernel running, (b) data reload, (c) kernels reload
Custom Framework for Run-Time Trading Strategies 159
Op = rt + te (7)
where rt stands for reconfiguration time and te stands for increased execution
time due to reduced clock frequency. In our case, when we take the trading
decision and perform the returns using Formula 4 on CPU, we take into consid-
eration our two different filters as follows: if one of our main trading strategies
(i.e. TS1 - TMAC) suggests a BUY position should be taken - we check if one
DRAM DRAM
FPGA FPGA
PR Region PR Region
Switch
TSm+1
TSm+2
TSm+1
TSm+2
TSx+1
TSx+2
TSm
... ...
TSn
... ...
TSn
TS1
TS2
TSx
CPU CPU
(a) (b)
(i) PRTS: (a) Kernel Running, (b) Kernel Switch (ii) PR Config. Region
of our filters (i.e. TS3 - PROC) suggests the same - if not, we decide not to
pursue with the position. It is important to note that this is just one particular
approach in which this framework could be used.
4 Implementation Details
The implementation of the proposed designs depends heavily on the proper-
ties of the target system. The accelerator system we use is a Maxeler MPCX
node. If we were to use a different architecture than the one based on Maxeler’s
infrastructure, then the implementation and its respective performance would
change according to the new accelerator’s specifications: e.g. the reconfiguration
time could change using a different reconfiguration methodology on different
boards, or the communication channel may change among different architec-
tures. The system properties are summarised in Table 1 and it consists of a CPU
node in 32 nm transistor technology, and a DFE node with the FPGA in 40 nm
transistor technology. The two are connected via Infiniband through a Mellanox
FDR Infiniband switch. The implementation of the architectures follows thor-
oughly their design as presented in Figs. 1, 2 and 3.
The Virtex-6 SX475T FPGA used in this work has 16 clock regions: we do
not place PR regions in the central clock regions of the chip as this could reduce
the impact on the routing process for memory controllers.
(1) Exponential Moving Averages on FPGA. The FPGA design of our EMA
consists of a series of statements defining input and output streams and
computations on streams, as follows: As we store all data elements in mem-
ory, we have a register which stores the sum and at each tick it shifts in new
data and multiplies it to the present sum following Formula (1). We use the
exponential moving average in the following two strategies [4]:
(a) Double Moving Averages Crossover Trading Strategy (DMAC). It involves
two MAs: one short and one long. We pick the most encountered case in
practice for short-term market fluctuations, thus having the short and
long MAs computed over a 25-point respectively 200-point trading win-
dow of closing prices. The strategy trades when the short MA crosses
the long MA from above and below. In our system, DMAC will exit and
switch to the triple moving average crossover trading strategy when the
moving averages cross.
(b) Triple Moving Averages Crossover Trading Strategy (TMAC). It uses
three MAs: one short, one medium, and one long. The most common
MA lengths proven to give good results in practice are: 25-point trading
window for the short MA, 100-point for the medium MA and 200-point
for the long MA. The strategy takes a buy decision if: (i) Short MA is
162 A.-I. Funie et al.
above the medium MA; (ii) Medium MA is above the long MA where the
short MA is already over the medium MA.
(2) Price Rate of Change Trading Strategy (PROC). The most common period
for PROC is 12-periods for short-term signals. We decide to use this value
for testing, as it is aligned with our high-frequency trading strategy trend-
following approach. Generally, a negative PROC value shows that the mar-
ket is being oversold, while a positive PROC value observes the market as
being overbought. In our case, when the PROC value ≤ −30% we decide to
take a Buy position [4].
(3) Bollinger Bands Trading Strategy (BB). We use a 20 period EMA as the
“middle band” (one of the mostly used values for short-term trend identifi-
cation in the financial markets), thus our “lower band” and “upper band”
BB values being based on a 20 period prices standard deviation as well. This
trading strategy acts as a filter on top of the other trading strategies, along-
side the PROC trend-following strategy, for further trend direction strength
optimisation [4].
6 Evaluation
Our PRTS implementation runs at a clock frequency of 150 M hz, while our
FRTS and SRTS implementation run at a clock frequency of 175 M hz. For the
PRTS approach an increase in clock frequency from 150 M hz to 175 M hz is dif-
ficult to obtain as it was hard to meet timing requirements. All the run times are
measured by using the chrono::high resolution clock which is part of the
C++11 standard library. Both CPU and FPGA times measured include the time
to process the total market ticks (respectively the total market ticks between
switches). We perform different experiments on both synthetically generated
FX GBP/USD market data at different trading frequencies, as well as histori-
cal data: First, we analyse the speedup and returns for SRTS, PRTS and FRTS
designs in an offline environment. Then, we identify the best trading opportunity
when checking the obtained performance for multiple data set dimensions, and
different trading-window switch frequencies. Last, we simulate a real-time trad-
ing environment, thus accounting for the data loss encountered during different
trading-window switch durations when using FRTS and PRTS.
(possibly with different parameters - in practice, traders choose to run the same
trading strategy with different parameters in parallel, so that they can identify
its optimum coefficients at any point in time). Table 3 shows the resource usage
for the generalised framework version as described previously.
Table 2. FPGA total resource usage for static kernels. Measurements are provided for
864M data points and 175 Mhz clock frequency
Table 3. Resource usage for PRTS and FRTS provided for 864M data points
Table 5. PRTS - FPGA performance results for 864M market data entries
Table 6. FRTS - FPGA performance results for 864M market data entries
streaming algorithm. We notice that when we lose access to the data correspond-
ing to the switch time period, we seem to be decreasing our overall returns, as
well as encounter losses at times. However, the PRTS returns are higher than the
FRTS ones which shows that simply switching between trading strategies at dif-
ferent times is not good enough, but by introducing momentum filters (as in the
PRTS approach) we better account for the financial markets condition changes
and avoid under-performance of one particular selected trading strategy.
Table 7. PRTS vs FRTS return results for 864M market data entries
Figure 4(i) shows the different returns for both PRTS and FRTS solutions,
when we account for the data loss that would appear during the switch time.
This graph presents a 30 min trading strategied switch frequency over different
market data entries (i.e.: 28.8M). The “real-time” simulation of our trading
strategy shows that when losing access to the data corresponding to the switch
time period, returns decrease as data become less reliable and more volatile.
Figure 4(ii) shows the different switch times corresponding to each of the
respective number of market entries, when running both PRTS as well as FRTS,
using all the implemented trading strategies from the strategy kernel library. We
notice PR solution regardless the number of market entries, while it increases
with the increase of the market data points in the case of the FR implementation.
costs. Table 8 shows returns for all static, partial-reconfiguration as well as full-
reconfiguration approach when trying to simulate a continuously streaming algo-
rithm using historical market data. We can notice a slight decrease in the return
levels from 2003 and 2004, being very much in accordance with the greater FX
market efficiency in 2004 compared to 2003 (i.e. a growth in electronic high-
frequency trading occurred during the 2003–2008 period).
Our study aims to provide the first framework for developing and comparing
multiple trading strategies for FPGA designs. Our tool offers the user multi-
ple solutions for running their trading strategies. Three architecture types are
supported: static, partial-reconfiguration and full-reconfiguration. Our approach
offers alternative solutions when a static design becomes too large because of too
many different trading strategies or the trading strategies themselves are very
complex and occupy a significant amount of resources [8]. If the resources of a
Custom Framework for Run-Time Trading Strategies 167
given device run out, a larger FPGA would be needed, but if not available, our
framework offers the user a low-cost, resource and performance efficient solution.
We show that FPGAs can effectively accelerate a system based on multiple
trend-following trading strategies which come as an initial library for our frame-
work. Our SRTS design achieves 11 times speedup, the PRTS design achieves 7
times speedup, while the FRTS design achieves up to 2 times speedup, when com-
pared to the corresponding multi-threaded C++11 implementation running on a
six-core Intel Xeon CPU X5650 processor. After testing our tool on historical FX
data, we show that trading strategies supported by the proposed design are reli-
able and, if further exploited, can increase profitability from high frequency FX
markets trading. Thus, applying different trading strategies based on different
market regimes would help the modeling process better reflect the reality.
Opportunities for further work include adding support for varying data repre-
sentation and evaluating speedup/returns improvements on more recent financial
market data. We could include multiple copies of trading strategies on-chip so
that one could start processing without waiting for a previous computation to
finish. We also aim to enhance the trading strategies kernel library, implementing
additional effective strategies on the FPGA, as well as developing macroeconomic
and news factors for stock and fixed income trading. We further plan to include
designs to optimally detect regime change, such as those based on permutation
entropy [9]. In the future, we will also make our framework available as open
source, to allow developers to add their custom strategies to the library.
References
1. Wray, S., et al.: Exploring algorithmic trading in reconfigurable hardware. In: ASAP
(2010)
2. Altera. FPGA Run-Time Reconfiguration: Two Approaches - White Paper.
ftp://ftp.bittware.com/documents/fpga-run-time-reconfiguration.pdf
3. Driver, M.: Foreign Exchange: A Practical Guide to the FX Markets. CreateSpace,
North Charleston (2012)
4. Aldridge, I.: High-Frequency Trading: A Practical Guide to Algorithmic Strategies
and Trading Systems (Wiley Trading), 2nd edn. Wiley, Hoboken (2013)
5. Maxeler Technologies, MaxCompiler-WhitePaper (2001). https://www.maxeler.
com/media/documents/MaxelerWhitePaperMaxCompiler.pdf
6. Leber, C., et al.: High frequency trading acceleration using FPGAs. In: FPL (2011)
7. Mastinu, M.: Design flow to support dynamic partial reconfiguration on Maxeler
architectures. Politecnico di Milano (2012)
8. Funie, A.I., et al.: Reconfigurable acceleration of fitness evaluation in trading strate-
gies. In: ASAP (2015)
9. Guo, C., et al.: Pipelined reconfigurable accelerator for ordinal pattern encoding.
In: ASAP (2014)
Exploring HLS Optimizations for Efficient
Stereo Matching Hardware Implementation
1 Introduction
etc [1]. Today several existing HLS tools have shown their efficiency for producing
acceptable design performances and shortening time-to-market [6,8].
For a given design, defining the priority of constraints could vary from one
application to another. For example, power consumption is a key factor for
battery-based systems while hardware resources matter if several functionalities
would be embedded on the same chip. In some other cases, timing is crucial for
safety critical applications while Quality-of-Service is important for interactive
or multimedia applications. During the design phase, it is the role of the designer
to define the priorities of system constraints then to explore the design space
for the implementation that could efficiently satisfy them. In this research work,
the design space was built by applying a set of high level synthesis optimiza-
tion steps. The obtained designs have different trade-offs in terms of hardware
resources (FF, LUT or BRAM), power consumption, timing and operating fre-
quency. Our objective is to explore the possible hardware designs then choose
the one that most fit with our requirements. As a case study, we focus on the
development of an FPGA-based system dedicated to streaming stereo matching
applications. Our application considers Multi-Window Sum of Absolute Differ-
ence (Multi-Window SAD) algorithm [4] performed on input gray images of size
640 × 480 with maximum disparity = 64.
As a similar work targeting stereo matching domain, authors in [9] examined
five stereo matching algorithms for their HLS implementation. Five optimization
steps were applied to the SW code: baseline implementation, code restructuring,
bit-width reduction, pipelining and parallelization via resource duplication. Our
work differs from that presented in [9] as follows: (i) Baseline implementation
is considered as step zero in our work because our input code is HLS-friendly.
(ii) Dividing an image into strips can be achieved in three different ways with
vast difference in terms of execution time and resource utilization (Optimization
#1). (iii) Parallelism was exploited in both work at different levels. In our work,
data-independent loops are executed in parallel by duplicating the input data
stream (Optimization #3). We also increased the number of processed dispar-
ity lines at the same time either by enlarging the size of strip (Optimization
#7) or by duplicating the top-level function (Optimization #8). While authors
in [9] applied parallelism only by duplicating the disparity computation pipeline.
Authors in [7] purposed an optimized C-code for Sobel filter in three steps.
Although the design run on Zynq platform; no details were mentioned on how
the HLS-based Sobel filter was interfaced and connected to the system. In this
work, we will detail this point in Sect. 4. In addition to that two more optimiza-
tion steps related to Zynq platforms are presented in Sect. 3 (Optimization #5
and #6).
The rest of this paper is organized as follows: Sect. 2 describes our case study
related to Multi-Window SAD stereo matching algorithm. Section 3 represents
our main contribution that explores high level optimization steps for an efficient
implementation for our case study. System architecture and experimental results
are presented in Sect. 4.
170 K.M.A. Ali et al.
Several methods in the literature were proposed to find the best match-
ing [10]. In Multi-Window SAD [4], the absolute difference between pixels from
the right and left images are aggregated within a window. The window of min-
imum aggregation is considered as the best matching among its candidates. In
order to overcome the error that appears at the regions of depth discontinuity,
the correlation window can be divided into smaller windows and only non-errored
parts are considered in calculations. Figure 1b shows 5-window SAD configura-
tion: pixel (P) lies in the middle of window (E) while it is surrounded by another
four windows named (A, B, C and D). The four windows are partially overlapped
at the border pixel (P). The score of any window is equal to the aggregation of its
pixels. In 5-window SAD, the correlation score at pixel (P) is equal to the score
value of window (E) in addition to the best minimum two score values of the
other four windows (A, B, C and D). The minimum score among the candidates
is considered as the best matching. Occluded objects are common to happen
in stereo matching problem where sometimes the objects are only captured by
Fig. 1. (a) Calculating the depth of an object in stereo matching problem (b) 5-window
SAD configuration
Exploring HLS Optimizations 171
one camera. For example, pixel (M) in Fig. 1a was only captured by the right
camera. Therefore, Left/Right consistency check is done in order to get rid of
occluded objects from the final disparity image.
Table 1. Synthesis results reported by Vivado HLS for each optimization step
Design Slice FF LUT BRAM 18K SRL Freq. Exec. time % change
(MHz) (ms) in Perf.
SW version 380 ms on core i7@ 2.7 GHz and 16 GB of RAM
#1 X 2637 5918 7392 0 100 X X
#2 898 1743 2735 155 0 100 30080 0
#3 859 1758 2659 113 0 100 22410 25.4
#4 1400 2552 3738 75 0 100 8163 72.8
#5 983 1525 2567 47 0 100 5786 29.1
#6 985 1713 2768 65 0 100 2679 53.7
#7 2695 6088 7611 57 0 100 328 87.7
#8 2688 6134 7661 59 0 100 331 −9.1
#9 2822 6365 8116 59 0 100 307 7.2
#10 7989 20256 24433 112 0 100 76 75.2
#11 7995 18765 24945 112 39 150 51 32.8
#12 8038 21250 26483 112 121 200 38 25.5
4 Experimental Results
The generated HLS SAD IP was tested experimentally to validate both its
proper functioning and the estimated results. During our experiments, we used
Vivado 2015.2 design suite to implement our system over Zynq ZC706 FPGA
evaluation board (XC7Z045-FFG900) with input grey images of size 640 × 480.
The system was configured for 5-window SAD with the following parameters:
winH = 23, winV = 7, cwinH = 7, cwinV = 3 and maximum disparity = 64.
Figure 2 illustrates the connection of HLS SAD core to the other cores in
the system. Pixels were transferred between the processing system (PS ) and
HLS SAD block through two AXI DMA cores. AXI VDMA and HDMI cores
were used to display the obtained disparity image on the output screen.
We obtained different design choices by exploring the effect of optimiza-
tion #8 at different operating frequencies of 100, 150 or 200 MHz as listed in
Table 2. During the experiments, we increased the level of parallelism up to 8
instances operating at the same time. We stopped at that level due to the limited
LUT resources (design #23 consumed 95.37% of LUT). Default synthesis and
implementation strategies were used by default for all designs. For design #18,
Flow Perf Optimized High and Performance Explore were used as synthesis and
174 K.M.A. Ali et al.
BRAM18K FF
Execution
LUT
time
Power Frequency
Fig. 3. Radar chart for designs #15 , #16 , #17 , #18 , #23 and system
constraints . (Color figure online)
than others even if they consumed more power. For example, design #23 had the
lowest energy consumption of 21.51 mJ although it recorded one of the highest
power consumption (2.12 W).
All design variations listed in Table 2 could be accepted as a solution but
the applied system constraints will direct our final decision to choose one design
among the others. Figure 3 depicts some of the candidate designs (#15, #16,
#17, #18 and #23) along with the system constraints to guide the designer
towards the efficient solution. The orange shaded area represents the system
constraints defined by the designer which are: power consumption ≤ 2 W, exe-
cution time ≤ 15 ms, LUT ≤ 180000, FF ≤ 140000, BRAM ≤ 700 and frequency
≤ 150 MHz. From Fig. 3, we could deduce that design #17 succeeded to satisfy
all the system constraints. Design #15 had relatively less hardware utilization
and acceptable execution time in compare with design #17; however, it failed to
meet two design constraints (power consumption and frequency).
5 Conclusion
Using HLS tools for complex system design becomes mandatory to increase the
design productivity and to shorten the time-to-market. As a future work, we will
automatically explore designs at higher level of parallelism. In addition to that
we will build a model to predict if that design is feasible or not for a given set
of system constraints.
176 K.M.A. Ali et al.
References
1. Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., Zhang, Z.: High-
level synthesis for FPGAs: from prototyping to deployment. IEEE Trans. Comput.
Aided Des. Integr. Circ. Syst. 30(4), 473–491 (2011)
2. Coussy, P., Gajski, D.D., Meredith, M., Takach, A.: An introduction to high-level
synthesis. IEEE Des. Test Comput. 26(4), 8–17 (2009)
3. Gonzalez, I., El-Araby, E., Saha, P., El-Ghazawi, T., Simmler, H., Merchant,
S.G., Holland, B.M., Reardon, C., George, A.D., Lam, H., Stitt, G., Alam, N.,
Smith, M.C.: Classification of application development for FPGA-based systems.
In: IEEE National Aerospace and Electronics Conference, NAECON 2008, pp.
203–208, July 2008
4. Hirschmuller, H.: Improvements in real-time correlation-based stereo vision. In:
IEEE Workshop on Stereo and Multi-Baseline Vision, (SMBV 2001), Proceedings,
pp. 141–148 (2001)
5. McDonnell, M.: Box-filtering techniques. Comput. Graph. Image Process. 17(1),
65–70 (1981)
6. Meeus, W., Van Beeck, K., Goedemé, T., Meel, J., Stroobandt, D.: An overview of
today’s high-level synthesis tools. Des. Autom. Embed. Syst. 16(3), 31–51 (2012)
7. Monson, J., Wirthlin, M., Hutchings, B.L.: Optimization techniques for a high level
synthesis implementation of the Sobel filter. In: 2013 International Conference on
Reconfigurable Computing and FPGAs (ReConFig), pp. 1–6, December 2013
8. Nane, R., Sima, V.M., Pilato, C., Choi, J., Fort, B., Canis, A., Chen, Y.T.,
Hsiao, H., Brown, H., Ferrandi, F., Anderson, J., Bertels, K.: A survey and evalua-
tion of FPGA high-level synthesis tools. IEEE Trans. Comput.-Aided Des. Integr.
Circ. Syst. 35(99), 1 (2016)
9. Rupnow, K., Liang, Y., Li, Y., Min, D., Do, M., Chen, D.: High level synthe-
sis of stereo matching: productivity, performance, and software constraints. In:
2011 International Conference on Field-Programmable Technology (FPT), pp. 1–
8, December 2011
10. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int. J. Comput. Vis. 47(1–3), 7–42 (2002)
Architecture Reconfiguration as a Mechanism
for Sustainable Performance of Embedded
Systems in case of Variations in Available Power
1 Introduction
The main objective of the presented research has been to find an effective mech-
anism for autonomous embedded computing systems like systems deployed on
satellites, robotic systems, stand-alone terrestrial and marine monitoring sys-
tems etc., to be able to sustain performance of their multi-task workloads in
presence of significant variations in available power. The concept of the pro-
posed approach is to adapt to a reduced energy budget by using architecture
reconfiguration to appropriately reduce system power consumption. In [1], it was
shown that, for a given task algorithm, a number of architecture variants can
be obtained, which exhibit different performance, operating frequency, resource
usage and power consumption. These architecture variants can be referred to
as Application Specific Processing circuits (ASP circuits). Dynamically recon-
figuring a suitable ASP circuit for each active task can provide the system with
much greater control over its power consumption. Required performance of high
priority tasks can be sustained by reconfiguring an ASP circuit which utilizes
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 177–186, 2017.
DOI: 10.1007/978-3-319-56258-2 16
178 D. Sharma et al.
more hardware resources at a reduced operating frequency. For tasks with lower
priority, variants which use fewer resources can be reconfigured so that they can
still continue their functionality at a reduced performance (e.g. lower video reso-
lution or communication bandwidth). As a result, current power constraints are
satisfied while maintaining the required performance for most important tasks.
Thus, when there is a change in the power budget, the system selects a
suitable combination of ASP circuit variants for its active tasks such that the
total power consumption is reduced. In order to do so, the system must be able
to estimate at run-time, power consumption of the combination of ASP circuit
variants under consideration. For a large system with say, 10 active tasks and
10 ASP circuit variants per task, the possible number of combinations of ASP
circuits that can be deployed on the FPGA is 1010 ! Measuring FPGA power
consumption or estimating it using vendor tools for such a large number of
combinations and maintaining a look up table for the purpose of adaptation is
practically not feasible. Hence, a run-time PCE model which is general enough to
estimate power consumption of the FPGA for any combination of ASP circuits,
while maintaining a certain degree of accuracy needs to be developed.
Additionally, the proposed approach also requires the development of a spe-
cial on-chip infrastructure to be deployed in partially reconfigurable FPGA
devices, which provides flexible reconfiguration and re-location of ASP circuits.
Multi-mode Adaptive Collaborative Reconfigurable self-Organized System, i.e.
MACROS framework has been developed for this purpose. The framework is
briefly described in this paper and is detailed in [2,3].
This paper has the following contributions: (a) It presents a generic proce-
dure to derive a PCE model for any FPGA; (b) It discusses the mechanism of
architecture reconfiguration of a suitable ASP circuit variant for each system
task, based on system power consumption estimated by the model at run-time.
This enables the system to adapt to varying power constraints at run-time.
Section 2 of the paper analyzes existing works in the field of controlling and
modeling power consumption for FPGA-based systems. Section 3 discusses in
brief, the MACROS framework introduced above. Section 4 presents in detail
the method to derive PCE model for an FPGA using the example of Xilinx
Zynq XC7Z020 device. Section 5 uses the extracted model to demonstrate how
a system sustains its critical task’s performance for a longer time in case of
depleting power. Section 6 concludes the paper and discusses future work.
2 Related Works
Commonly adopted methods for reduction and control of power consumption in
embedded systems are power gating [4,5] and Dynamic Voltage and Frequency
Scaling (DVFS) [6,7]. Both the methods do not allow architecture variability and
hence cannot take advantage of the potential trade-offs between parallelism and
operating frequency. Use of Dynamic Partial Reconfiguration (DPR) coupled
with frequency control can offer more power consumption flexibility.
DPR is used in [8] to control the number and type of computing circuits
present in a power-aware system for efficient dynamic resource management.
Architecture Reconfiguration as a Mechanism for Sustainable Performance 179
However, here too, since tasks architectures are fixed, the system can only man-
age scheduling and allocation of the computing circuits to control power con-
sumption. The authors of [9] use the concept of design variants which have differ-
ent resource utilization and operating frequency. Although they analyze power
consumption with respect to parallelism and frequency scaling, a fixed choice of
an optimum variant is made by the designer at design time. Our approach on
the contrary uses the concept of task architecture variants from [1] to adapt to
dynamically changing power conditions at run-time.
A system using our proposed approach requires a model, which can estimate
system power consumption for any combination of task variants at run-time.
Most of the high level modeling methods are aimed towards specific entities like
IP Cores [10] or arithmetic operators [11] or soft processors [12] etc. and are
not generic for the FPGA as a whole. The run-time model presented in [8] is
also based on intimate knowledge of the task architecture, and cannot be easily
expanded to large numbers of architectures. Development of a high-level, simple
and accurate model for the FPGA thus becomes inevitable.
3 MACROS Framework
111.375, 148.5 and 185.625 MHz. A maximum frequency of 185.625 MHz is used,
as beyond this multiple, the timing requirements of the design start to fail for
some test cases. Thus, a total of 25 test cases are generated for this step.
Dynamic power consumption (DPC) of Zynq is now measured for each test
design. ZedBoard has an on-board current sense resister of 10 mΩ in series with
its 12 V power supply [14]. Voltage across this resistor is measured using Agilent
Technologies Digital Multimeter, U3401A. Current consumed by the board and
its total power consumption is calculated from this voltage. Subtracting the
static power of the board from the total power measured for a test case provides
the DPC of Zynq for that case. Since the test design uses only FPGA resources,
the calculated DPC corresponds to power consumed by Zynq alone.
Although the obtained DPC is due to both, logic slices and BRAMs, variation
in DPC is due to changes in number of logic slices. Hence, the obtained results
are plotted with respect to logic slices at every frequency, as shown in Fig. 2a.
From these plots, the linear equations representing the relation between power
consumption and number of logic slices are obtained at every frequency, and are
also shown in Fig. 2a. It can be observed that the linear coefficients are dependent
on frequency. To add to that, the constant term in the equations is also seen
increasing with frequency. This can further be split into a fixed constant and
a frequency dependent offset, a portion of which is due to the 10 BRAMs in
the design. Thus, the set of equations in Fig. 2a, can be summarized into one
equation involving all parameters, namely, frequency, logic slices and BRAMs.
This following general equation represents the model for DPC of any FPGA.
Fcc
DPC(F P GA) (mW ) = × {CLS × NLS + CB × NB + CF } + B (1)
Fmin
In (1), Fcc is current operating clock frequency and Fmin is minimum operating
frequency for the applications. CLS , CB and CF are coefficients representing the
182 D. Sharma et al.
Fcc (MHz)
DPC(F P GA) (mW ) = × {0.006 × NLS + 30} + 60 (2)
37.125
In (2), CLS = 0.006, B = 60, NB = 10 and 10 x CB + CF = 30. The individual
values of CB and CF will be figured from the next step.
Step 2 - Isolate Behavior of BRAM Slices: In this step, both, BRAM and
logic slices are increased simultaneously as opposed to increasing BRAM slices
alone, as increasing BRAMs in the design will increase logic slices as well. For
our set of experiments, 5 test cases are generated by varying logic and BRAM
utilization on Zynq from 2900 slices and 20 BRAMs to 13100 slices and 120
BRAMs in 5 steps. Frequency is again varied in multiples of 37.125 MHz up to
185.625 MHz, thus generating a total of 25 test cases. DPC of Zynq is again
obtained for each test case. Substituting the number of slices used in every test
case in this step, in (2), and subtracting the calculated DPC values from the DPC
values measured in this step, we get DPC of BRAMs alone. These values are
plotted with respect to BRAM slices at different frequencies, as shown in Fig. 2b.
From the set of equations in Fig. 2b, it is observed that the linear coefficients in
this case too depend on frequency. Thus, the set of equations in Fig. 2b can be
summarized into the following general equation which represents BRAM DPC.
Fcc
DPC(BRAM ) (mW ) = × {CB × NB } (3)
Fmin
From Fig. 2b, in case of Zynq, dynamic power consumption of BRAMs is:
Fcc (MHz)
DPC(BRAM ) (mW ) = × {1.2 × NB } (4)
37.125
Step 3 - Complete the Model Equation: From (4), CB = 1.2. Substituting
CB in 10 x CB + CF = 30, obtained from (2), we get CF = 18. Thus, the
combined model for power consumption due to logic slices, BRAM slices and
frequency can be summarized for Zynq XC7Z020 as:
Fcc (MHz)
DPC(F P GA) (mW ) = × {0.006 × NLS + 1.2 × NB + 18} + 60 (5)
37.125
The first and second terms in (5) represent the individual effect of slices and
BRAMs on power consumption. The third term is a frequency dependent con-
stant and the last term is a constant offset. Using (5) and comparing the esti-
mated results with the measured results, the maximum difference between the
two is 30 mW. Thus, the model accurately represents the true DPC of Zynq
XC7Z020. The same procedure can be followed to incorporate IOBs and DSP
slices and also to derive a model for any FPGA.
Architecture Reconfiguration as a Mechanism for Sustainable Performance 183
From the derived PCE model, the following can be analyzed: Since the coef-
ficient for logic slices is very small, the power consumed by the slices can be
considered as a negligible constant, especially at low frequencies, for further
simplification of the model. The BRAMs on the other hand have 200 times
(1.2/0.006 = 200) the impact of logic slices, which is a significant contribution
to dynamic power. To add to that, power reduction due to reduction in frequency
by a certain factor is more than due to reduction in resource utilization by the
same factor. This means that if resource utilization is doubled and the frequency
is halved, the power consumption reduces instead of staying the same.
The process from measurements for all test cases up to model derivation took
around 8 h. Use of predictions from Xilinx Power Analyser (XPA) instead of
measurements was also attempted. Default switching activity values resulted in
a slope for increase in power due to increase in frequency, which was 60% higher
than that from measurements. Generating SAIF file for an accurate activity
factor took around 1 day for simulation per test case and hence was avoided.
Using trial and error, when close to accurate activity factors were fed into XPA,
the predicted slope for increase in power due to increase in frequency was equal to
that from measurements. It can be concluded that XPA can be used if accurate
activity factors are available, otherwise actual measurements is the most accurate
and fastest method to obtain FPGA power consumption.
Variant no. No. of slots Fcc (MHz) Performance No. of No. of Power
slices BRAMs (mW)
T1 - 1 4 37.125 400 MBps 8591 80 226
T1 - 2 2 74.25 400 MBps 4312 40 244
T1 - 3 1 148.5 400 MBps 2200 20 281
T2 - 1 1 37.125 30 fps 1504 15 105
T2 - 2 2 37.125 60 fps 2950 30 132
T2 - 3 1 74.25 60 fps 1504 15 150
T2 - 4 4 37.125 120 fps 5853 60 185
T2 - 5 2 74.25 120 fps 2950 30 203
T2 - 6 1 148.5 120 fps 1504 15 240
T3 - 1 1 37.125 4 Mbps 2028 33 130
T3 - 2 2 37.125 8 Mbps 3960 66 181
T3 - 3 1 74.25 8 Mbps 2028 33 200
T3 - 4 4 37.125 16 Mbps 8046 132 285
T3 - 5 2 74.25 16 Mbps 3960 66 302
T3 - 6 1 148.5 16 Mbps 2028 33 339
Case 1: When the battery is fully charged at 100% capacity, variants TI-3,
T2-6 and T3-6 are configured. They occupy one slot each and run at Fcc =
148.5 MHz at their maximum performance of 400 Mbps, 120 fps and 16 Mbps
respectively. Thus, as seen in Fig. 1a, three slots can be used as spare resources
for adaptation. From Table 1, DPC of the FPGA due to the three active tasks is
equal to 860 mW. Adding the static power of 2200 mW, the total system power
consumption is 3060 mW. Current consumption is therefore 255 mA, making the
system sustainable up to 7.8 h.
Case 2: At the end of one hour, battery capacity reduces by 255 mA-h to
around 87% of its capacity, as shown in Fig. 3b. System power budget shows
that it can live for 6.8 h. If battery is recharged later than this period of time,
the system can shut down. Suppose that the predicted time for battery recharge
is 7 h. To adapt to the situation, the system, as shown in Fig. 1b, dynamically
reconfigures variants T1-2, T2-5 and T3-5, all of which occupy two slots and
operate at 74.25 MHz, maintaining their maximum performance. This adaptation
of the SoPC architecture allows extension of active time without performance
degradation. A calculation as above gives the current consumption as 245 75
mA. The system can now work for 7 h and 6 min and thus prevent its shut down
before the battery can begin re-charging.
Case 3: After another hour, battery capacity is depleted to around 75% of
its capacity, as shown in Fig. 3b. Based on system power budget, it can func-
tion up to 6.1 h. However, due to external conditions, the battery charge can
Architecture Reconfiguration as a Mechanism for Sustainable Performance 185
now begin recharging only after 6.7 h: 42 min later than expected 6 h. The sys-
tem should now dynamically reconfigure variants T1-1, T2-1 and T3-1 as shown
in Fig. 1c, where Fcc = 37.125 MHz. ASP T1-1 occupies 4 slots to maintain
its required performance. ASPs T2-1 and T3-1 occupy one slot each to pro-
vide a reduced performance of 30 fps and 4 Mbps respectively. The system now
consumes a current of 221.69 mA to increase its active time by 40 min, again
preventing shut down before the battery can begin re-charging.
If the system had continued at the initial 255 mA, it would function up to
7.8 h as shown in Fig. 3a. However, due to adaptation, the system sustains itself
at the desired performance of T1 for around one hour more, as seen in Fig. 3b,
while simultaneously preventing system shut down. This example thus demon-
strates the system’s ability to adapt to reduced power budget without perfor-
mance degradation when reserved resources can compensate for lack of power.
Also, when all resources are used, further reduction of the power budget can be
compensated by reducing performance of non-critical tasks.
Configuring Zynq with a full bit-stream using JTAG consumes only 100 µW-
h of energy. A dynamic reconfiguration cycle using partial bit-streams over the
PCAP/ICAP port would consume much lesser energy and hence can be neglected
compared to energy consumed by ASP circuits for their execution time.
6 Conclusion
A PCE model based architecture reconfiguration approach is presented for
FPGA-centric systems such that they can adapt to changing power budget and
continue execution of their critical tasks at required performance for extended
time. The method for deriving the PCE model is presented using Xilinx Zynq
XC7Z020 FPGA as an experimental platform but can be applied to any FPGA.
An example of a FPGA-centric system was analyzed in light of sustaining its crit-
ical task performance with or without performance degradation of non-critical
tasks according to available power. It was shown that the proposed approach
also prevents system shut down before battery re-charge. The benefits of the
186 D. Sharma et al.
References
1. Dumitriu, V., Kirischian, L., Kirischian, V.: Mitigation of variations in environmen-
tal conditions by sopc architecture adaptation. In: 2015 NASA/ESA Conference
on Adaptive Hardware and Systems (AHS), pp. 1–8, June 2015
2. Dumitriu, V., Kirischian, L.: Sopc self-integration mechanism for seamless archi-
tecture adaptation to stream workload variations. IEEE Trans. Very Large Scale
Integr. (VLSI) Syst. 24(2), 799–802 (2016)
3. Dumitriu, V., Kirischian, L., Kirischian, V.: Run-time recovery mechanism for tran-
sient and permanent hardware faults based on distributed, self-organized dynamic
partially reconfigurable systems. IEEE Trans. Comput. 65(9), 2835–2847 (2016)
4. Tabkhi, H., Schirner, G.: Application-guided power gating reducing register
file static power. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 22(12),
2513–2526 (2014)
5. Hosseinabady, M., Nunez-Yanez, J.L.: Run-time power gating in hybrid arm-fpga
devices. In: 2014 24th International Conference on Field Programmable Logic and
Applications (FPL), pp. 1–6, September 2014
6. You, D., Chung, K.S.: Quality of service-aware dynamic voltage and frequency
scaling for embedded GPUS. IEEE Comput. Archit. Lett. 14(1), 66–69 (2015)
7. Khan, M.U.K., Shafique, M., Henkel, J.: Power-efficient workload balancing for
video applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24(6),
2089–2102 (2016)
8. Rodrı́guez, A., Valverde, J., Castañares, C., Portilla, J., de la Torre, E.,
Riesgo, T.: Execution modeling in self-aware FPGA-based architectures for effi-
cient resource management. In: 10th International Symposium on Reconfigurable
Communication-centric Systems-on-Chip (ReCoSoC), pp. 1–8, June 2015
9. Ali, K.M.A., Atitallah, R.B., Fakhfakh, N., Dekeyser, J.L.: Using hardware paral-
lelism for reducing power consumption in video streaming applications. In: 10th
International Symposium on Reconfigurable Communication-centric Systems-on-
Chip (ReCoSoC), pp. 1–7, June 2015
10. Jovanovic, B., Jevtic, R., Carreras, C.: Binary division power models for high-level
power estimation of FPGA-based DSP circuits. IEEE Trans. Industr. Inf. 10(1),
393–398 (2014)
11. Jovanovic, B., Jevtic, R., Carreras, C.: Triple-bit method for power estimation of
nonlinear digital circuits in FPGAS. Electron. Lett. 46(13), 903–905 (2010)
12. Senn, L., Senn, E., Samoyeau, C.: Modelling the power and energy consumption of
NIOS II softcores on FPGA. In: 2012 IEEE International Conference on Cluster
Computing Workshops, pp. 179–183, September 2012
13. Xilinx: Zynq-7000 All Programmable SoC Overview v1.10, September 2016
14. Xilinx: ZedBoard Hardware User’s Guide v2.2 January 2014
Fault Tolerance
Exploring Performance Overhead Versus Soft
Error Detection in Lockstep Dual-Core ARM
Cortex-A9 Processor Embedded into Xilinx
Zynq APSoC
1 Introduction
radiation effects, as they are composed of millions of SRAM cells used to config-
ure all the synthesized logic, the embedded processors, DSPs, and memories [1].
Embedded systems operating in aerospace applications are especially suscepti-
ble to radiation effects caused by ionized particles. Systems in avionics and at
ground level can also be affected by radiation-induced soft errors due to inter-
action with neutron particles present in the atmosphere [2]. These particles can
interact with silicon, provoking transient pulses in some susceptible areas. Such
episodes might lead to Single Event Upset (SEU) – or bit flips – in the sequen-
tial logic that could induce errors, generating Silent Data Corruption (SDC) and
other failures in the system, like hangs and crashes [3].
In this work, we developed an approach based on dual-core lockstep (DCLS)
technique to improve the dependability in the embedded dual-core ARM Cortex-
A9 processor of the Xilinx Zynq-7000 APSoC, and analyzed different setups in
terms of performance, execution time overhead, and soft error recovery. The
proposed DCLS architecture relies on two ARM cores running with independent
embedded BRAM memories to duplicate the application execution, and a checker
module to validate the processors’ output and, in case of failure, rollback the
application. The novelty lies in to apply the DCLS to a hard-core ARM Cortex-
A9 in which, for the best of our knowledge, we have not seen a work that focus on
this processor. Besides, the use of an exclusive BRAM memory to each processor,
in order to avoid a single point of failure on the data memory increasing the
reliability. Results show that the overhead in the execution time strongly depends
on the number of checkpoints with the relation between the application size
and the size of checkpoint and rollback routines. A fault injection method was
developed to emulate soft errors in the dual-core processor to validate the DCLS
approach. Experiments indicate that the proposed DCLS approach for the dual-
core ARM-A9 is able to mitigate around 91% of the bit flips injected in the ARM
register file. Nevertheless, the proposed DCLS approach can be extendable to
other APSoC devices, such as Xilinx Zynq-7000 UltraScale and Intel Cyclone V.
2 Related Works
There are many fault-tolerance techniques to improve the dependability of
processors. They can be classified as hardware-based, software-based and hybrid
techniques [3]. DCLS is a hybrid fault-tolerance technique based on hardware
and software redundancy for error detection and correction. It uses the concepts
of checkpoint combined with rollback mechanism at software level, and processor
duplication and checker circuits at hardware level. The DCLS works by executing
the same application simultaneously and symmetrically in two identical proces-
sors, which are initialized to the same state with identical inputs (code, bus
operations and asynchronous events) during system start-up. So, during normal
operation the state of the two processors is identical from clock to clock. The
DCLS technique assumes that an error in either processor will cause a difference
between the states, which will eventually be manifested as a difference in the
outputs. Thus, the DCLS system monitors the outputs of both processors and
Exploring Performance Overhead Versus Soft Error Detection 191
persistent errors in the embedded processors caused by bit flips in the program-
mable configurable blocks as occur with soft-core processors.
Fig. 2. Lockstep functional flow for ARM Cortex-A9 dual-core: (a) original code, (b)
code with lockstep technique running in both CPUs and (c) the checker functionality.
CPU is in user mode, although there are systems modes that could access some
of these registers to store instructions or system information. Finally, the ARM
deprecates the use of the special registers - stack pointer (SP), link register (LR)
and program counter (PC) registers - for any purpose other than as they are
specified for; the incorrect handling of these registers could lead the system to
an unpredictable behavior.
To guarantee that the application will not be locked on the code block exe-
cution due a fault in the system, it is configured a watchdog timer with twice
the time required to run each code block. Therefore, if a CPU did not reach the
verification point before the watchdog timer is over, it is considered a system
inconsistency and the Checker operates the rollback mechanism.
Fig. 3. Fault injection: (a) experiment setup and (b) procedure flow.
5 Experimental Results
We selected applications based on a set of matrix multiplications operations as
benchmark to evaluate the proposed DCLS in the ARM-A9. Both CPUs exe-
cute the bare-metal application in hand-shake. Each full matrix multiplication
operation is considered one code block defined in Fig. 2(b). Each matrices opera-
tion multiplies different matrices inputs, which contains data of 32 bits, that are
stored in BRAM memory. As the objective is investigating the effect of the code
block size and the number of code blocks in the application in terms of perfor-
mance and soft error correction, we consider applications with five matrix sizes
(3 × 3, 10 × 10, 20 × 20, 30 × 30 and 40 × 40) and composed of three (short appli-
cation) or six (long application) code blocks. To validate the proposed DCLS, it
was used the setups DCLS BR and DCLS BR DDR, and the UNHARDENED
version, which is not protected against soft errors and runs its applications only
on CPU0. The CPU’s L1 and L2 caches are disabled, and it is connected to a
BRAM memory, which stores the application data, and to the external DDR
memory, which stores the program instruction.
The versions of the application use the following ARM general-purpose reg-
isters: R0 to R5, R8, and R11; besides the specifics: SP, LR and PC. This repre-
sents the usage of 68% of the register file. Although all versions were compiled
allocating the same registers, the exposure time and functionalities during the
execution time are different, which affects the reliability as verified in the fault
injection experiments. In all versions, UNHARDENED and DCLS setups, the
R0 to R3 and R11 registers are used in the matrix multiplications operations
and the R4 and R5 registers are used in the printf function to show the perfor-
mance information. Furthermore, the R0 to R3 and R11 registers are also used
to save and restore the context in the DCLS setups. To verify the flow execution
198 Á.B. de Oliveira et al.
The area resources in terms of LUTs and flip-flops are detailed in Table 1. One
can observe the area overhead is around 280% for the logic implemented in
the CLBs, which is consistent with related works [7,8]. For the processors and
memories, the area overhead is 100%. Table 2 reports the results in terms of per-
formance comparing the execution times of different matrix sizes, distinct setups
and also for two application sizes. The execution time obtained depends on sev-
eral factors as following: the time required to both CPUs execute the application
in hand-shake; the application size; the number of checkpoints performed; the
time needed to run the interrupt routine that implements the context saving,
which is directly affected by the amount of data stored; and the execution time
of the rollback operation (just in case of errors). As described in the Table 2, the
performance overhead is significantly higher, around 425% to DCLS BR and
625% to DCLS BR DDR, when the execution time of the application is much
smaller compared to the time to perform the checkpoint and the rollback rou-
tines. For large applications, the time overhead of DCLS BR is less than 25%,
which can be an acceptable overhead in many applications that require high reli-
ability and availability. When considering the setup DCLS BR DDR the time
overhead in all versions is higher compared to DCLS BR, as expected, due the
time to access the DDR memory.
In order to evaluate the impact of soft errors in a dual-core ARM processor and
to validate the efficiency of the proposed DCLS approach, we run an extensive
fault injection campaign in the Zedboard. We tested long and short applications
performing 3 × 3, 10 × 10 and 20 × 20 matrix multiplications in UNHARDENED,
DCLS BR and DCLS BR DDR setups. Table 2 shows the fault injection results.
For the UNHARDENED versions, up to 70% of the injected faults are UNACE.
For the DCLS setups, one can observe that the DCLS approach is able to recover
around 91% of the injected faults in the DCLS BR and 90.5% in DCLS BR DDR
Table 1. Area overhead analysis
Table 2. Performance overheads and fault injection analysis for each setup running
different matrix sizes
setup. Up from 8% of the bit flips that could not be recovered provoke hangs
in the DCLS system. This result can be explained by two facts. First, there
are some registers (R0, R1, R11 and R12) that could not be protected by our
solution because these registers have distinct values in CPU0 and CPU1 during
the normal program execution. Therefore, if a bit flip affect one of them during
a code block execution and its effect is masked, witch consequently does not
affect the outputs, the Checker will not detect the error in those registers. This
200 Á.B. de Oliveira et al.
will lead to store the actual context with the wrong values as a safe state. Thus,
the fault can manifest itself at the next code block execution leading a rollback
operation that will restore the wrong context causing, then, an infinite loop in
system. The hang or timeout can be identified, but only can be recovered by
reset. In addition, if a fault affect any of the special registers (SP, LR or PC),
generating an illegal data or instruction value, the processor will be directed to
a data or prefect abort leading to a system crash. Finally, the number of injected
faults in our approach that produce SDCs (wrong outputs values) is negligible,
even so they can be explained by bit flips in the LR or PC registers that can
direct the program pointer to the end of the application. Thus, when the outputs
results are compared with the gold ones they mismatch and a SDC is indicated.
6 Conclusion
References
1. Siegle, F., et al.: Mitigation of radiation effects in sram-based fpgas for space
applications. ACM Comput. Surv. 47(2), 37:1–37:34 (2015)
2. Normand, E.: Correlation of inflight neutron dosimeter and seu measurements with
atmospheric neutron model. IEEE Trans. Nucl. Sci. 48(6), 1996–2003 (2001)
3. Azambuja, J., et al.: Hybrid Fault Tolerance Techniques to Detect Transient Faults
in Embedded Processors. Springer International Publishing, Switzerland (2016)
4. Bowen, N.S., Pradham, D.K.: Processor- and memory-based checkpoint and roll-
back recovery. Computer 26(2), 22–31 (1993)
5. Reorda, M.S., et al.: A low-cost see mitigation solution for soft-processors embed-
ded in systems on pogrammable chips. In: DATE, pp. 352–357, April 2009
6. Violante, M., et al.: A low-cost solution for deploying processor cores in harsh
environments. IEEE Trans. Ind. Electron. 58(7), 2617–2626 (2011)
7. Gomez-Cornejo, J., et al.: Fast context reloading lockstep approach for SEUs mit-
igation in a FPGA soft core processor. In: IECON, pp. 2261–2266, November 2013
8. Pham, H.M., et al.: Low-overhead fault-tolerance technique for a dynamically
reconfigurable softcore processor. IEEE Trans. Comput. 62(6), 1179–1192 (2013)
9. Abate, F., et al.: A new mitigation approach for soft errors in embedded processors.
IEEE Trans. Nucl. Sci. 55(4), 2063–2069 (2008)
Exploring Performance Overhead Versus Soft Error Detection 201
10. Cortex-R5 and Cortex-R5F Technical Reference Manual. Rev: r1p1 (2010–2011)
11. Tambara, L.A., et al.: Analyzing the impact of radiation-induced failures in pro-
grammable SoCs. IEEE Trans. Nucl. Sci. 63(4), 2217–2224 (2016)
12. ARM R Architecture Reference Manual. ARMv7-A and ARMv7-R edition (2012)
13. Taylor, A.: How to use interrupts on the Zynq SoC. Xcell J. 87, 38–43 (2014)
14. Rezgui, S., et al.: Estimating error rates in processor-based architectures. IEEE
Trans. Nucl. Sci. 48(5), 1680–1687 (2001)
15. Velazco, R., et al.: Predicting error rate for microprocessor-based digital architec-
tures through C.E.U. (code emulating upsets) injection. IEEE Trans. Nucl. Sci.
47(6), 2405–2411 (2000)
16. Lins, F., et al.: Register file criticality on embedded microprocessor reliability. In:
Proceedings RADECS (2016)
Applying TMR in Hardware Accelerators
Generated by High-Level Synthesis Design
Flow for Mitigating Multiple Bit Upsets
in SRAM-Based FPGAs
1 Introduction
is no study that have investigated the use of TMR applied at C language level to be
synthesized in HLS and evaluated in SRAM-based FPGAs for SEUs.
The case-studied FPGA is a 28-nm Artix-7 FPGA from Xilinx. Different TMR
approaches were implemented in a matrix multiplication algorithm described in C
language connected to a soft-core Microblaze responsible for sending and receiving the
workload data stream. Bit-flips were injected into the FPGA bitstream by a fault
injection framework developed in our research group [5]. Several fault injection
campaigns were performed for all the designs in order to identify the error rate under
accumulated bit-flips. Results show that the TMR can mask multiple errors as expected,
but redundancy in the voters and in the interface is mandatory to increase reliability.
Results show that by using a coarse grain TMR with triplicated inputs, voters, and
outputs, it is possible to reach 95% of reliability by accumulating up to 61 bit-flips and
99% of reliability by accumulating up to 17 bit-flips in the configuration memory bits.
These numbers imply in a Mean Time Between Failure (MTBF) of the coarse grain
TMR at ground level from 50% to 70% higher than the MTBF of the unhardened
version for the same reliability confidence.
The concept of TMR is to have three identical copies processing data and a majority
voter voting their outputs to mask errors in one of the copies. TMR can be implemented
in hardware at gate level, for instance, where each module is triplicated and voters are
added, but it can also be implemented in software, where part of the code is triplicated
and its outputs are voted. According to the granularity of the TMR and the location of
the majority voters, there is the coarse grain TMR (CGTMR), in which voters are
placed only at the outputs of the design, and there is the fine grain TMR (FGTMR), in
which voters are placed at the outputs of all or selected flip-flops and/or combinational
logic, according to the design requirements. In this work, we are implementing TMR in
a piece of high-level code to generate a hardware block through HLS. Thus, after
synthesis, redundant hardware and majority voters are automatically generated. The
input/output interfaces can be triplicated or not. However, if the interface is not trip-
licated, single point of failures can be observed in the TMR design.
When describing an algorithm to be synthesized by an HLS tool, one can consider
that the algorithm source code is composed of operations, conditional statements,
loops, and functions. Therefore, TMR must be implemented in these code structures.
The question is how to triplicate all these structures to generate coarse or fine grain
TMR in an efficient way, ensuring that the redundant logic will not be removed and, at
the same time, being able to take advantage of some of the optimization strategies
usually provided by HLS tools.
By default, an HLS tool translates each high-level function call in an RTL block.
As consequence, if a function is called three times, three identical RTL blocks will be
generated and the HLS tool will interpret that they can be executed in parallel if no data
dependencies exist among them. Conversely, if we perform an operation three times in
sequence inside a same function, the HLS tool will generate a serial hardware in which
each operation will be executed sequentially, one at a time. With regards to the majority
Applying TMR in Hardware Accelerators Generated by HLS 205
voters, since they are always implemented as a function call, they are always syn-
thesized as independent RTL blocks. These are the main principles in which our
investigation relies. Lastly, based on these approaches, one can observe that in a
modularized design (parallel), the majority voters are placed separately of the TMR
blocks, while in a non-modularized design (serial), the majority voters are placed
together with the TMR circuitry. In this work, we investigate coarse grain TMR
implemented in parallel, named CGPTMR.
For hardware accelerators, the interface to receive the workload data stream is very
important. In Xilinx devices, high-performance hardware accelerators are usually con-
nected to soft- or hard-core processors through a Direct Memory Access (DMA) inter-
face and Advanced eXtensible Interface Stream (AXI-S) ports. This interconnect
infrastructure provides a pipelined control that enables the software running on the
processor to queue multiple tasks requests, reducing its latency. According to [9], each
accelerator operates as an independent thread, synchronized in hardware at the transport
level by AXI-S handshaking, with the input arrival and accelerator hardware
“start/done” synchronization barriers realized by the Stream interface of the DMA.
The architecture of the proposed evaluation setup is composed of the design gen-
erated by the HLS (here referred as the Design Under Test - DUT), a Microblaze
soft-core processor, which is a 32-bit 5-state pipeline Reduced Instruction Set Com-
puter (RISC) soft processor, Advanced eXtensible Interface (AXI) units, memories
(BRAM), Direct Memory Access (DMA) unit and the fault injector framework, as
described in Fig. 1. Note that in Fig. 1(a) there is only one interface for communica-
tion, while the setup in Fig. 1(b) the input and output interfaces are triplicated.
Fig. 1. Block diagram of the (a) CGPTMR SingleStream and (b) CGPTMR MultipleStream
case-study designs connected to the Microbaze soft-core processor and fault injection framework.
Fig. 2. The three versions of the M M implementations: Unhardened with single stream(a),
CGPTMR with single stream(b) and CGPTMR with multiple stream(c) represented in the
number of steps to run the applications.
(Fig. 2(a)). In case of TMR, the redundancy can be implemented in parallel by trip-
licating the functions as represented in Fig. 2(b) and maintaining the single stream AXI
port interface. In this case, each function is triplicated and a single voter is placed at the
end of the code to vote out the data outputs. This scheme is named coarse grain parallel
TMR with single stream (CGPTMR SingleStream). The voters and interfaces can also
be triplicated, as shown in Fig. 2(c). This scheme is named coarse grain parallel TMR
with multiple stream (CGPTMR MultipleStream). In this work, we are exploring these
two implementations to analyze how area and performance overhead are impacted and
comparing also with the reliability of the TMR scheme. The resource allocation and
binding select the necessary and efficient RTL resources to implement behavioral
functionalities.
We selected matrix multiplication (MxM) algorithm, shown in Fig. 3, to start our
investigation, as this algorithm is rich in parallelism and loops. Each input matrix is a
6 6 8-bits array generating a 6 6 16-bits array output. Three versions of the
M M algorithm were implemented and generated using the Xilinx Vivado HLS tool
from the C algorithm source code: TMR Coarse Grain Parallel version (CGPTMR)
without optimization and single stream input, output data, TMR Coarse Grain Parallel
version (CGPTMR) without optimization and multi stream input and output data, and
the unhardened version without optimization single stream input, output data.
It is important to mention that for TMR implementations, it is not advised to use the
Vivado HLS optimization option named function inline, which optimizes designs for
area. Function inline removes the function hierarchy aiming to improve area by
allowing the components within the function to be better shared or optimized with the
logic in the calling function, which is something that is not recommended for redundant
circuits.
The CGPTMR version code is represented in Fig. 4 with single stream and in Fig. 5
with multiple streams. Each function call is replicated. Optimizations performed in the
function are extended to all the replicas. The majority voter votes the data output bit by
bit after the call of the three redundant functions. The status is used to check bit by bit if
there is any difference among the three modules. Status equal to zero means that all bits
match, otherwise status is equal to one.
Fig. 6. The fault injection methodology in (a) and the FPGA floorplanning of designs in (b).
Applying TMR in Hardware Accelerators Generated by HLS 209
The Microblaze is responsible to send the input data as a data stream through AXI
connections, receive the data output stream, and to compare the received values with
the reference ones. The data is sent in 288 bits (6 6 8-bit matrix data) through the
AXI interface. All the system runs at 100 MHz. The execution time of the Microblaze
is around 175,727 clock cycles, which includes the time to send the control to the DUT,
send the data inputs, wait for the DUT execution, read data outputs, compare the
values, and wait for the fault injection framework next injection. For the DUT design,
the execution time contains the number of clock cycles for reading the input data,
execute the multiplication of matrix, voting, and writing data output. As an example,
for the CGPTMR SingleStream, it is needed 216 clock cycles for reading the input
data, 710 clock cycles to execute the HLS application, 156 clock cycles to execute the
majority voter, 36 clock cycles to write the voter data, and 36 clock cycles to write the
status data. The total time spent to perform all operations is 1,154 clock cycles.
4 Experimental Results
Table 1 presents the area resources and performance. The area can be evaluated by the
number of LUTs, flip-flops, and DSP blocks. One can notice that the TMR designs
present very similar areas. In this work, we mapped all the designs to the same target
area of 388 frames. The area overhead of the TMR designs is three times or more, as
expected. The maximum overhead is reached when the inputs and outputs AXI
interfaces are triplicated. In terms of performance, each TMR design presents a very
different execution time compared to the unhardened version. As explained, the exe-
cution time is calculated by the number of clock cycles needed to read the input
matrices, execute, vote, and write the output matrices. The performance overhead of the
TMR designs comes from the fact that the data input and data output is now triplicated
in time as well, and the voting phase also takes several clock cycles of the total
execution time, as shown in Fig. 2.
Accumulated SEUs where injected as described in Sect. 3. Although each design
uses a different amount of resources as detailed in Table 1, the fault injection cam-
paigns considered the same injection area for all designs. Thus, we stablish a condition
similar to all designs, which emulates a same fluence of particles on its surface, for
instance.
In this work, each DUT was implemented in a rectangular physical block of 388
configuration memory frames. Since a frame on Xilinx Artix-7 FPGA has 3,232 bits,
the total inject area comprises 1,254,016 bits. The value of essential bits is obtained
from the Vivado Design Suite tool [3]. In this case, it is only considered the HLS
accelerator design under fault injection. In the fault injection campaigns, the number of
SEUs injected was limited up to 300 bits. Since the number of faults injected is small
compared to the total number of configuration bits in the fault injection area, the
likelihood of the same bit getting hit more then once is small allowing the error rate to
be estimated as the average of errors over total injected faults. The average error rate for
the different design, aside its upper and lower quartile, is presented in Fig. 7.
300
200
100
0
Unhardened CGPTMR CGPTMR
SingleStream MultiStream
A more detailed comparison of the designs can be seen in Fig. 8, where reliability
is presented as the complement of the cumulative failure distribution (R(t) = 1−F(t)).
The failure rate F(t) of the system is the probability of one or more modules have failed
by time t. In our case, Fig. 8 represents the reliability in terms of the accumulated
bit-flips.
The inferiority of the CGPTMR SingleStream design, even when compared to the
unhardened design, can be related to the amount of data that is serialized through the
stream, as can be seen in Fig. 2(b) CGPTMR Single Stream steps and the single point
of failures of the DMA interconnection. Being a single point of failure, not only a
communication failure is more likely due to larger amount of data being serialized, but
also it jeopardizes the efforts placed on the TMR implementation. On the other hand,
the CGPTMR MultipleStream gives clearly a reliability improvement over the
unhardened along the range of SEUs injected on this experiment. Notwithstanding, as
with to any TMR implementation, that may be a crossing point ahead in the reliability
curves where the unhardened performs better than the TMR implementation.
Even with these experiments limited to 300 SEUs injected, the expected expo-
nential behavior of reliability curves and the relationship among the reliability of the
design can be seen when we look at this same data in semi-log coordinate, as presented
in Fig. 9. Two useful observations can be extracted from Fig. 9 contributing to further
engineering decisions. First, if any recovery strategy, such as scrubbing or system
reconfiguration by reset, is to be is activated before the expected time when up to
approximately 10 SEUs are accumulated, then the power of TMR will not be exploited
and no profit is given by its implementation on the system. Second, as we can see a
trend that the crossing from better TMR performance to better unhardened performance
occurs somewhere between 300 and 1,000 SEUs, that defines the upper bound limit of
TMR performance.
Considering the neutron flux at New York as reference (13 n/cm2.h) [10] and the
static neutron cross-section of Artix-7 FPGAs (7 10−15 cm2/bit) [11], we can esti-
mate the static neutron cross-section of the target area (388 frames 3232 bits =
1,254,016 bits), which is 8.78 10−9 cm2. The expression to obtain static neutron
cross-section of the target area is:
Failure rate is the most common reliability metric. The failure rate itself is either
time-dependent or time-independent [12]. The failure rate target area (1.14 10-7 h−1)
212 A.F. dos Santos et al.
is calculated by multiplying the value of the static neutron cross-section by the neutron
flux at New York, as follows:
Mean time between failure (MTBF) is defined as the average amount of time a
device or product works before it fails. We calculate the MTBF of bit-flips
(8.7 106 h) for the target area as follows:
1
MTBFtarget area ¼
Failure ratetarget area
Table 2. Reliability of accumulated bit-flips and MTBF for the unhardened version and the
CGTMR multistream version
Reliability Accumulated bit-flips MTBFdesign
Unhardened CGPTMR Unhardened CGPTMR
99% 10 17 8.7 107 1.48 108 (+70%)
95% 41 61 3.6 108 5.4 108 (+50%)
90% 55 87 4.8 108 7.6 108 (+58%)
5 Conclusions
References
1. Carmichael, C.: Triple Module Redundancy Design Techniques for Virtex FPGAs. Xilinx,
Application Note XAPP197, July 2006
2. Habinc, S.: Functional Triple Modular Redundancy (FTMR). Gaisler Research, Design and
Assessment Report FPGA-003-01, December 2002
3. Xilinx: Vivado Design Suite - User Guide - High-Level Synthesis. UG902 (v2014.3), 1
October 2014
4. Tambara, L.A., Kastensmidt, F.L., Rech, P., Frost, C.: Decreasing FIT with diverse triple
modular redundancy in SRAM-based FPGAs. In: Proceedings of IEEE International
Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, pp. 1–6,
November 2014
5. Tonfat, J., Tambara, L., Santos, A., Kastensmidt, F.: Method to analyze the susceptibility of
HLS designs in SRAM-based FPGAs under soft errors. In: Bonato, V., Bouganis, C.,
Gorgon, M. (eds.) ARC 2016. LNCS, vol. 9625, pp. 132–143. Springer, Heidelberg (2016).
doi:10.1007/978-3-319-30481-6_11
6. Winterstein, F., Bayliss, S., Constantinides, G.A.: High-level synthesis of dynamic data
structures: a case study using Vivado HLS. In: Proceedings of International Conference on
Field-Programmable Technology, pp. 362–365, December 2013
7. Tambara, L.A., Tonfat, J., Santos, A., Lima Kastensmidt, F., Medina, N.H., Added, N.,
Aguiar, V.A.P., Aguirre, F., Silveira, M.A.G.: Analyzing reliability and performance trade-
offs of HLS-based designs in SRAM-based FPGAs under soft errors. IEEE Trans. Nucl.
Sci. 1, 1–8 (2017). ISSN: 0018-9499
8. Chan, X., Yang, W., Zhao, M., Wang, J.: HLS-based sensitivity-induced soft error
mitigation for satellite communication systems. In: 2016 IEEE 22nd International
Symposium on On-Line Testing and Robust System Design (IOLTS), pp. 143–148
(2016). ISSN 1942-9401
9. Xilinx: AXI4-Stream Accelerator Adapter v2.1. PG081 18 November 2015
10. JEDEC: Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced
Soft Errors in Semiconductor Devices JEDEC Standard (2006). http://www.jedec.org/sites/
default/files/docs/jesd89a.pdf
11. Xilinx Inc.: Device Reliability Report, UG116 (v9.4) (2015)
12. Crowe, D., Feinberg, A.: Design for Reliability (2001). ISBN 13:978-1-4200-4084-5. https://
www.crcpress.com/
FPGA-Based Designs
FPGA Applications in Unmanned Aerial
Vehicles - A Review
1 Introduction
have been carried out to verify the versatility of the FPGAs in each real-time
part of an UAV system. As a result, some products have already appeared in
the market. This led us to take a look back to explore the suitability of FPGAs
in high-level control techniques (such stereo vision, Simultaneous Localization
and Mapping (SLAM) and path planning), as well as the low-level critical tasks
(such as stability, data acquisition and motor control). We also explore FPGA
usage in mission critical tasks such as object recognition and tracking. This
paper presents a wide study on FPGA applications in every task concerning on-
board processing of UAVs. We aim to provide researchers and designers signifi-
cant information which can help enhancing their design strategy. We structured
this paper as follows. The commercial products are studied in Sect. 2. Section 3
deals with high-level control techniques leveraging FPGA’s advantages. Then, in
Sect. 4, we present recent studies in which FPGAs are used in low-level control.
The use of FPGA in mission-critical tasks is presented in Sect. 5. Finally, we
conclude our work in Sect. 6 with a summary and future challenges.
Remote Control
Receiver
Controllers
Set Point Stability
Processing Controller
Gyroscopes
Accelero-
meters NavigaƟon CalculaƟng
Controller Motor
Control
Magneto- Drivers
Sensor Values
meters
Processing
Pressure
Height
Sensor
Controller
GPS
Module
SoŌware
Ultrasonic
Rangefinder
SD-Card Radio
Logging Module
3 High-Level Control
High-level control techniques refers to tasks that enable the autonomous navi-
gation. Data dependencies in Fig. 2 show that the navigation system is the first
decision maker in the hierarchy, hence we considered them as high-level control
tasks. Deploying image processing and machine learning algorithms is common
in such algorithms. Moreover, these tasks can take advantage of the parallelism in
computation which makes FPGAs and Application-Specific-Integrated-Circuits
(ASIC) suitable candidates for hardware implementations.
3.2 SLAM
4 Low-Level Control
Sensor processing, state estimation, stability and motor control are the most
safety-critical tasks in UAVs as they are used in all existing platforms and
perform the basic processing in UAVs. Their role is to send data and receive
decisions from high-level control algorithms like the path planner, hence we
regrouped them as low-level control tasks. In researches, the use of FPGA was
also significant for these tasks.
The role of stability systems is to maintain a desired state of UAV via a number
of controllers. Each controller is responsible of one of the angles (roll, pitch, yaw),
angular velocities and the altitude. Although the use of FPGA-based controllers
is widespread in industrial applications [23], only few researchers used FPGAs in
this specific area. Custom hardware implementation in FPGA of a proportional
integral controller for the rotation axes of a small-scale quadrotor is proposed in
[24]. It outperforms the software approach using ARM7 microcontroller, achiev-
ing 4.3 MHz control loop rate compared to 0.71 MHz in software. The same
group implemented proportional-integral-derivative controller in a Zynq FPGA
controlling a micro-UAV, while using HW/SW approach for their motion plan-
ning algorithm [25]. In mixed criticality system, the need for a good hardware
separation between critical and non-critical tasks is necessary. In [26] a Zynq-
7000 was used as the hardware of a multi-rotor. Safety-critical tasks, including
the stability system, were implemented in the PL using two Microblaze proces-
sors, while mission critical tasks were implemented in the Processing System
(PS). For the good functioning of UAVs in unknown dynamic environments, the
FPGA Applications in Unmanned Aerial Vehicles - A Review 223
The use of state estimation is a very important task in UAV systems for
which Kalman filters are the mostly used [28]. The literature provides several
application-specific implementations of the Kalman filter in FPGAs. For more
general purposes, Soh and Wu [29] proposed a HW/SW co-design of Unscented
Kalman Filter. The algorithm was divided in a way that the hardware part
is application-independent. The author stated the results of different scenarios
according to the number of Processing Elements (PEs) (1,2,5 and 10). The algo-
rithm was implemented in a Xilinx Zynq-7000 series XC7Z045 showing a speed
improvement of over 2x compared to a software approach, while consuming less
energy (131 mW using only a single PE, increasing PEs increases speed at the
expense of energy consumption).
Small UAVs mostly use Brushless DC motors. FPGA was used in previous
researches to control such motors in two ways: First, the FPGA board gener-
ates PWM signal to COTS Electronic Speed Controllers. The latter is composed
of a microcontroller that runs specific control algorithms, and a power circuit.
This technique was used in [4,26]. Second, the FPGA generated PWM signals
224 M. Bouhali et al.
and ran control algorithms, with the external power circuit [33]. Essentially, the
algorithms require the speed and position of the rotor. Different techniques were
used either with or without sensors [34,35]. Another interesting approach was
proposed in [36]. The author deployed DPR to implement an adaptive controller
that switches between multiple fly modes of an octocopter. Due to the fact that
the UAV used can operate with 3, 4, 6 or 8 motors, they designed customized
hardware for PWM generator modules of modes. Then, they used DPR to switch
between modules. The whole stability system was implemented in a Zynq-7000
platform using the HW/SW co-design.
5 Mission-Critical Tasks
Other than essential algorithms for UAV’s FCS, many works have been done
for other mission-dependent processing. The use of FPGAs was significant in
these research area, especially on obstacle avoidance, object recognition and
communications.
References
1. Shamani, F., Sevom, V.F., Nurmi, J., Ahonen, T.: Design, implementation and
analysis of a run-time configurable memory management unit on FPGA. In: Nordic
Circuits and Systems Conference (NORCAS): NORCHIP & International Sympo-
sium on System-on-Chip (SoC), Oslo, pp. 1–8 (2015)
2. Rodriguez-Andina, J.J., Valdes-Pena, M.D., Moure, M.J.: Advanced features and
industrial applications of FPGAs – a review. IEEE Trans. Ind. Inform. 11(4),
853–864 (2015)
3. HiSystems GmbH: MikroKopter-Boards. http://wiki.mikrokopter.de/en/MK-
Board
4. Konomura, R., Hori, K.: Phenox: Zynq 7000 based quadcopter robot. In: Inter
Confererence on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–6
(2014)
5. Phenox lab. http://phenoxlab.com/
6. Aerotenna. http://aerotenna.com/ocpoc/
7. Aerotenna User and Developer Hub. https://aerotenna.readme.io
8. Nikolic, J., Rehder, J., Burri, M., Gohl, P., Leutenegger, S., Furgale, P.T.,
Siegwart, R.: A synchronized visual-inertial sensor system with FPGA pre-
processing for accurate real-time SLAM. In: IEEE International Conference on
Robotics and Automation (ICRA), pp. 431–437 (2014)
9. Boikos, K., Bouganis, C.S.: Semi-dense SLAM on an FPGA SoC. In: 26th Interna-
tional Conference on Field Programmable Logic and Applications (FPL), pp. 1–4
(2016)
10. Honegger, D., Oleynikova, H., Pollefeys, M.: Real-time and low Latency embedded
computer vision hardware based on a combination of FPGA and mobile CPU.
In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.
4930–4935 (2014)
11. Oleynikova, H., Honegger, D., Pollefeys, M.: Reactive avoidance using embedded
stereo vision for MAV flight. In: IEEE International Conference on Robotics and
Automation (ICRA), pp. 50–56 (2015)
12. Barry, A.J., Oleynikova, H., Honegger, D., Pollefeys, M., Tedrake, R.: FPGA vs.
pushbroom stereo vision for MAVs. In: Vision-Based Control and Navigation of
Small Lightweight UAVs, IROS Workshop (2015)
13. Allaire, F.C.J., Tarbouchi, M., Labonté, G., Fusina, G.: FPGA implementation of
genetic algorithm for UAV real-time path planning. J. Intell. Robot. Syst. 54(1–3),
495–510 (2008)
14. Kok, J., Gonzalez, L.F., Kelson, N.: FPGA implementation of an evolutionary
algorithm for autonomous unmanned aerial vehicle on-board path planning. IEEE
Trans. Evol. Comput. 17(2), 272–281 (2013)
15. Schmid, K., Tomic, T., Ruess, F., Hirschmüller, H., Suppa, M.: Stereo vision based
indoor/outdoor navigation for flying robots. In: IEEE/RSJ International Confer-
ence on Intelligent Robots and Systems, pp. 3955–3962 (2013)
16. Angelopoulou, M.E., Bouganis, C.S.: Vision-based ego-motion estimation on
FPGA for unmanned aerial vehicle navigation. IEEE Trans. Circuits Syst. Video
Technol. 24(6), 1070–1083 (2014)
FPGA Applications in Unmanned Aerial Vehicles - A Review 227
17. Ulusel, O., Picardo, C., Harris C.B., Reda, S., Bahar, R.I.: Hardware acceleration
of feature detection and description algorithms on low-power embedded platforms.
In: 26th International Conference on Field Programmable Logic and Applications
(FPL), pp. 1–9 (2016)
18. Shamani, F., Airoldi, R., Ahonen, T., Nurmi, J.: FPGA implementation of a flex-
ible synchronizer for cognitive radio applications. In: Conference on Design and
Architectures for Signal and Image Processing (DASIP), Madrid, pp. 1–8 (2014)
19. Shamani, F., et al.: FPGA implementation issues of a flexible synchronizer suitable
for NC-OFDM-based cognitive radios. J. Syst. Architect. (2016)
20. Carlo, S.D., Gambardella, G., Prinetto, P., Rolfo, D., Trotta, P.: SA-FEMIP: A self-
adaptive features extractor and matcher IP-Core based on partially reconfigurable
FPGAS for space application. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
23(10), 2198–2208 (2015)
21. Van der Wal, G., Zhang, D., Kandaswamy, I., Marakowitz, J., Kaighn, K.,
Zhang, J., Chai, S.: FPGA acceleration for feature based processing applications.
In: IEEE Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), pp. 42–47 (2015)
22. Fowers, S.G., Lee, D.J., Ventura, D.A., Archibald, J.K.: The nature-inspired BASIS
feature descriptor for UAV imagery and its hardware implementation. IEEE Trans.
Circuits Syst. Video Technol. 23(5), 756–768 (2013)
23. Monmasson, E., Idkhajine, L., Cirstea, M.N., Bahri, I., Tisan, A., Naouar, M.W.:
FPGAs in industrial control applications. IEEE Trans. Ind. Inform. 7(2), 224–243
(2011)
24. Eizad, B., Doshi, A., Postula, A.: FPGA based stability system for a small-scale
quadrotor unmanned aerial vehicle. In: Proceedings of the 8th FPGA World Con-
ference, NY, USA, pp. 3:1–3:6 (2011)
25. Doshi, A.A., Postula, A.J., Fletcher, A., Singh, S.P.N.: Development of micro-UAV
with integrated motion planning for open-cut mining surveillance. Microprocess.
Microsyst. 39(8), 829–835 (2015)
26. Schlender, H., Schreiner, S., Metzdorf, M., Grüttner, K., Nebel, W.: Teaching
mixed-criticality: multi-rotor flight control and payload processing on a single chip.
In: Proceedings of the WESE: Workshop on Embedded and Cyber-Physical Sys-
tems Education, NY, USA, pp. 9:1–9:8 (2015)
27. Fowers, S.G., Lee, D.J., Tippetts, B.J., Lillywhite K.D., Dennis, A.W.,
Archibald, J.K.: Vision aided stabilization and the development of a quad-rotor
micro UAV. In: International Symposium on Computational Intelligence in Robot-
ics and Automation, pp. 143–148 (2007)
28. Chen, S.Y.: Kalman filter for robot vision: a survey. IEEE Trans. Ind. Electron.
59(11), 4409–4420 (2012)
29. Soh, J., Wu, X.: A FPGA-based unscented Kalman filter for system-on-chip appli-
cations. IEEE Trans. Circuits Syst. II: Express Br. PP(99), 1 (2016)
30. Christopherson, H., Pickell, W., Koller, A., Kannan, S., Johnson, E.: Small adap-
tive flight control systems for UAVs using FPGA/DSP technology. In: Proceeding
of AIAA “Unmanned Unlimited”, Technical Conference, Workshop and Exhibit
(2004)
31. Wang, X., Li, B., Yang, L., Huang, L., Wang, S.: A prototype of MEMS gyroscope
based on digital control. In: International Conference on Automatic Control and
Artificial Intelligence (ACAI), pp. 275–278 (2012)
32. Bai, C., Zhang, Z., Han, X.: A design and realization of FPGA-based IMU data
acquisition system. In: International Conference of Electron Devices and Solid-
State Circuits (EDSSC), pp. 1–2 (2011)
228 M. Bouhali et al.
33. Tefay, B., Eizad, B., Crosthwaite, P., Singh, S., Postula, A.: Design of an integrated
electronic speed controller for agile robotic vehicles. Presented at the Australasian
Conference on Robotics and Automation (ACRA), pp. 1–8 (2011)
34. Sathyan, A., Milivojevic, N., Lee, Y.J., Krishnamurthy, M., Emadi, A.: An FPGA-
based novel digital PWM control scheme for BLDC motor drives. IEEE Trans.
Ind. Electron. 56(8), 3040–3049 (2009)
35. Lin, C.T., Hung, C.W., Liu, C.W.: Position sensorless control for four-switch three-
phase brushless DC motor drives. IEEE Trans. Power Electron. 23(1), 438–444
(2008)
36. Thomas, N., Felder, A., Bobda, C.: Adaptive controller using runtime partial hard-
ware reconfiguration for unmanned aerial vehicles (UAVs). In: International Con-
ference on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–7 (2015)
37. Gohl, P., Honegger, D., Omari, S., Achtelik, M., Pollefeys, M., Siegwart, R.: Omni-
directional visual obstacle detection using embedded FPGA. In: IEEE/RSJ Inter
Conference on Intelligent Robots and Systems (IROS), pp. 3938–3943 (2015)
38. Tang, J.W., Shaikh-Husin, N., Sheikh, U.U., Marsono, M.N.: FPGA-based real-
time moving target detection system for unmanned aerial vehicle application. Int.
J. Reconfig. Comput. (2016)
39. Yasukawa, S., Okuno, H., Ishii, K., Yagi, T.: Real-time object tracking based on
scale-invariant features employing bio-inspired hardware. Neural Netw. 81, 29–38
(2016)
40. Moeys, D.P., Delbrück, T., Rios-Navarro, A., Linares-Barranco, A.: Retinal Gan-
glion Cell Software and FPGA Model Implementation for Object Detection and
Tracking. In: IEEE International Symposium Circuits and Systems (ISCAS), pp.
1434–1437 (2016)
41. Shimahara, S., Ladig, R., Suphachart, L., Hirai, S., Shimonomura, K.: Aerial
manipulation for the workspace above the airframe. In: IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pp. 1453–1458 (2015)
42. Giitsidis, T., Karakasis, E.G., Gasteratos, A., Sirakoulis, G.C.: Human and fire
detection from high altitude UAV images. In: 23rd Euromicro International Confer-
ence on Parallel, Distributed, and Network-Based Processing, pp. 309–315 (2015)
43. Lou, Y., Clark, D., Marks, P., Muellerschoen, R.J., Wang, C.C.: Onboard radar
processor development for rapid response to natural hazards. J. Sel. Top. Appl.
Earth Obs. Remote Sens. 9, 2770–2776 (2016)
44. Panda, A.R., Mishra, D., Ratha, H.K.: FPGA implementation of software defined
radio-based flight termination system. IEEE Trans. Ind. Inform. 11, 74–82 (2015)
45. Blümm, C., Heller, C., Weigel, R.: SDR OFDM waveform design for a UGV/UAV
communication scenario. J. Signal Process. Syst. 69, 11–21 (2012)
46. Mikó, G., Németh, A.: SCFDM based communication system for UAV applications.
In: 25th International Conference on Radioelektronika, pp. 222–224 (2015)
Genomic Data Clustering on FPGAs for
Compression
1 Introduction
With the advent of high throughput sequencing, genomics has entered a new
era where massive amounts of data are produced (∼2–40 ExaBytes/year are to
be expected in 2025 [9]). The sequencing of one human genome generates in the
order of 300 GB of raw data. This data is composed of small sequences randomly
located in the genome, with high redundancy (typically 30–50×). Processing
data in a timely fashion is imminently important for the future of genomics.
Another issue is the storage space required. Currently, many different data for-
mats are used and most of them are far from optimal (cf. [2,4,7]). Each format
has different characteristics, and so a universal standard is required to facilitate
the development of algorithms. This would for instance allow sharing the same
input/output logic. The authors are currently working on such a new format,
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 229–240, 2017.
DOI: 10.1007/978-3-319-56258-2 20
230 E. Petraglio et al.
and to that end, clustering, as it will be presented in this paper, can improve and
speed up genomic data compression. Especially for the compression of sequences
that do not share similarities with the reference human genome.
To better understand the context of genomic algorithms, and particularly the
compression of such data, the next subsection describes the data specificities.
The work presented in this paper aims at reducing the processing time of
clustering, since software solutions suffer from the algorithms high complexity,
O(n2 ), and the data set gigantic size. Our approach exploiting FPGAs dramat-
ically reduces up the processing time needed.
The rest of this paper is organized as follows: The next section introduces
the concept of clustering applied to genomic data. Section 3 presents the current
implementation focusing on hardware. Then Sect. 4 shows results. Finally Sect. 5
lists conclusions and introduces future work.
2 Clustering
Clustering data is a well known field of research, usually designing algorithms
with the goal of finding a number k of clusters grouping data with respect to
a neighbourhood function [3,6]. Clustering algorithms have been designed and
tailored for different domains including genomics [8], nevertheless no specific
clustering algorithm has been proven to be particularly useful for compression
of genomic data. Compression will benefit more from algorithms that define
clusters with highly correlated data, rather than having an exact number k of
clusters. Therefore, instead of doing k-clustering, such as k-means, k-medians
or k-medoids algorithms, it would be better to seek clusters, regardless of the
total number of clusters, using a small threshold neighbourhood function. The
following match function is used and shows how cluster membership is defined.
match returns true if both sequences should be in the same cluster and f alse
otherwise.
If N = 0 match becomes the = operator and will only return true if two
sequences are the same. With N > 0 two sequences can match with a distance
up to N , represented by a shift between them2 . Since DNA consists of a com-
plementary double helix, matching sequences with a reverse complement makes
2
Ignoring the non-overlapping ends of length d.
232 E. Petraglio et al.
5 values can be stored using 3 bits, and so the base-to-base matching function
can be done using a single 6-bit input look up table (LUT). This renders the
hardware implementation extremely efficient, as Altera and Xilinx FPGAs offer
such 6-bit input LUTs as basic hardware building blocks.
To the best of our knowledge, FPGA implementations of K-clustering such
as [5,10] have been published, but nothing compared to the proposed solution.
3 Design Implementation
This section will start with the software framework and its modular architecture.
Then the FPGA architecture implementing the clustering algorithm is detailed.
The reading stage gathers sequences from a FASTQ file. The processing stage
does the clustering of the sequences and the writing stage collects the results in an
output file. Each stage is realized by one or more threads and the communication
between the stages is done with one or more FIFO buffers (blocking queues).
This allows for a highly modular and performance oriented setup.
A version of the clustering unit has been implemented in software. It is opti-
mized using Intel’s AVX2 256-bit SIMD instructions. The software architecture
allows for easy deployment of multiple threads implementing this unit. Therefore
it can take full advantage of multi-core CPUs. The CPU and FPGA clustering
units are interchangeable allowing performance and result comparisons.
The typical setup for our FPGA board has six reader threads reading from
different portions of a FASTQ file feeding six clustering units each one on a
separate FPGA and threads collecting the results as they come in.
234 E. Petraglio et al.
The FPGA board is connected using a PCIe x16 slot on an Intel Haswell core
i7 based PC running GNU/Linux. The communication between the software and
the FPGAs is done using the PCIe bus. The drivers and API are provided by
Micron Pico Computing (the target platform being an EX-700 backplane with
AC-510 modules) but the software can easily be adapted for other hardware.
This section details the hardware architecture, and specifically the internal data
flow. Moving data is critical because of the tremendous amount of sequences to
be processed. Figure 4 shows the top hierarchy of our implementation.
Memory
Cache FSM
Controller
PCIe Matching
Stream Output Unit
FIFOs
No Match
Match
Clustering
FSM
In order to analyze the data stream flowing through the design, the latter
was cut off in three different parts: the interface, the core and the cache. The first
part implements the communication with the outside world, in which the PCIe
interface receives the reads coming from the CPU and sends them to the cluster-
ing unit. This part of the design is also in charge of storing the clustering results
and sending them back to the CPU. The central block contains the clustering
algorithm itself. It decides if the current sequence belongs to the present clusters
or not. In case of a positive match the ID and the score (alignment information)
of the sequence are sent back to the output FIFOs, otherwise the sequence is
forwarded to the FPGA cache. The latter is mainly composed of the memory
controller, which will store all the unmatched sequences from the clustering unit
into the external memory. During a run, data flows through two different paths
and the algorithm can be separated into two different phases.
Phase One. At the very beginning, the system receives sequences from the PCIe
interface. The latter directly forwards this information to the core block. The
clustering unit then starts. It flags the first non-matching sequences as references
and then compares the following sequences to these references. When a sequence
matches with a reference, the result is sent to the output FIFOs, and if it does
Genomic Data Clustering on FPGAs for Compression 235
Phase 2 Phase 2
Memory
Controller
Matching
PCIe Phase 1 Unit
Stream
Output No Match
FIFOs Match
Fig. 5. In 1st phase the data is coming from the outside while in 2nd phase it is coming
from the cache.
not match any cluster the sequence is stored in the cache memory (Fig. 5). It has
to be noted that the size of the memory used to store the unmatched sequences
will directly limit the maximum number of sequences that a single FPGA can
handle. In the worst case, which is, none of the sequences match the references of
the current clusters, the FPGA has to be able to store all the sequences into its
cache. This issue can be resolved by adding more external memory to the FPGA
subsystem or by cutting the FASTQ file into smaller pieces whose sequences can
fit into the cache memory. This can be done sequentially with a single FPGA
or in parallel with a group of FPGAs. It is noteworthy to add that this process
could lead to a sub-optimal clustering result but would allow handling files of
any size even on hardware settings with limited memory capacity.
Phase Two. When all the sequences in the FASTQ file have been sent to the
FPGA and were tested against all current clusters, the FPGA starts to work on
the sequences stored in the cache. These sequences are sent to the core module
once again. The latter will use the first non-matching sequences as new references
for the clusters and restart the clustering process. Again, if a sequence matches a
reference, its ID and score are sent to the output FIFOs, otherwise the sequence
will return to the cache in order to be processed during the next run of phase
two (Fig. 5). Phase two is repeated until all the sequences have been placed into
a cluster, leaving the cache empty.
Although phase one is executed only once, phase two is repeated a certain
number of times. This high number of executions combined with the memory
latency will slow down the clustering process. Two stratagems have been used to
minimize the slowdown. Firstly, using the fastest memory elements on the market
(i.e. HMC modules4 ) helps to decrease the store/load time of each sequence5 .
Secondly, using the maximum number of parallel clustering units as possible
decreases the number of executions of phase two since there will be less sequences
for each run of phase two.
4
http://hybridmemorycube.org/files/SiteDownloads/HMC Specification 1 0.pdf.
5
The memory is used as a circular buffer and only written to or read from in bursts
to maximize the read/write speeds.
236 E. Petraglio et al.
Clustering Memory
Cache FSM
FSM Controller
Sequences broadcast
Output
FIFOs
Table 1. Resource usage for a design counting 70 matching units with ±16 shifts and
reverse complement matching capability @ 125 MHz on a Kintex Ultrascale 060
This section summarizes the performance results and shows the speed gain
achieved thanks to hardware acceleration on FPGA. Three hardware designs and
three software versions were used. Each FPGA based version runs at 125 MHz,
The software version runs on an Intel core i7-4790 Haswell 4-core hyper-threaded
processor running at 4 GHz. All versions have reverse complement matching.
The versions differ on the maximum number of shifts tolerated for the match
function. This maximum number has a direct impact on the clustering units
complexity, the lower the maximum number of shifts the smaller the hardware
unit and the faster the software match function running time. This allows for
hardware implementations with more clustering units in the same FPGA. It is
also to be noted that by reducing this number a given sequence is less likely to
be a member of an existing cluster. This will lower the average size of the final
clusters and make the algorithm go through more passes of phase two. However
as can be seen on Fig. 7 having more clustering units outweights the fact that
the sequences are less likely to fit a cluster. This gives us a choice between bigger
sized final clusters and faster processing. Figure 8 shows the number of clusters
relative to their size (number of sequences in the cluster).
The sequences used during the experiments are unmapped paired sequences
of 126 bases. They were generated using an Illumina sequencer on a real human
sample.
The timing measurements shown in Fig. 7 were done using a single FPGA
module and multi-threaded software using a single thread for the clustering
algorithm. Both versions can benefit from more parallelism, having the hardware
run on six modules and using more threads for the CPU version. Running the
algorithm with six modules allowed us to process six times more data with the
same timing results. This was possible because there is no dependency and no
I/O bottleneck since the unmatched sequences reside in the HMC local to the
modules. Having six threads for the clustering units on our 4-core CPU (8 logical
cores thanks to hyper-threading) processed six times more sequences but also
ran slower due to resource sharing (memory) resulting in a speedup of only 3.53
compared to the speedup of 6 on the multi FPGA version.
In order to do clustering on a real case, around 100 million sequences would
have to be processed (∼24GB FASTQ file). To do this the file would be split
between six FPGA modules or 6 CPU cores limiting the amount of time needed
238 E. Petraglio et al.
10 5
4 shifts
8 shifts
16 shifts
10 4
10 3
number of clusters
10 2
10 1
10 0
2
50
0
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
to process it. Table 2 shows the time needed to achieve this. The values were
measured or extrapolated9 from real timing measurements using a FASTQ file
of unaligned sequences (with respect to the reference genome).
All in all, the software versions take so much time they rapidly become unus-
able. However the results using FPGA acceleration seem reasonable.
Table 2. Proc. time needed to cluster a real case file of ∼ 100×106 unaligned sequences
The effective speed gain between using software running on CPU and using
FPGA based accelerators is colossal. It takes more than a 1000 times longer
in software, leading to an impressive 2.6 years of CPU processing, to clus-
ter unaligned sequences of a single person’s genome. This becomes much more
acceptable using FPGA based acceleration and now requires only around half a
day of processing time. It is to be noted that while the FPGAs are processing
the sequences the CPU is almost at idle, where its only task is to collect results
and to write them to a file. The CPU potential processing power could be used,
e.g., joining cluster results from different FPGAs.
In terms of power consumption, running the software on the same PC draws
around 100 W of power (without FPGA card installed). Running the software
with FPGA accelerators, the PC draws around 220 W of power. In terms of
energy needed for the task the CPU version needs 100 W for 2.6 years and the
FPGA based version 220 W for under a day. This is almost 700 times less energy.
The goal of this paper was to introduce a clustering framework based on FPGA
acceleration, with the idea of providing clusters of sequences to ease genomic data
compression. A solution for clustering unaligned genomic sequences has been
found and verified. This solution already offers a massive speed gain against
processor based implementations (×1000) as well a significant energy savings
(×700). The framework is modular enough to be easily modified and further
developped. This allows to explore new (compression) algorithms using clustering
as well as to research new algorithms for general genomic data processing.
9
The values following the ≈ sign in Table 2 are extrapolated.
240 E. Petraglio et al.
To improve on and go further with this work several paths could be taken.
Having a PCIe switch on the FPGA board makes it possible for the modules to
communicate using PCIe x8 links without interfering with the PC. Communica-
tion between units could allow for better clustering results. Using heterogenous
accelerators in this setup would grant even more possibilities. Future work should
also include quantifying compression rates relative to the clustering algorithm in
order to determine the best implementation (in number of shifts and clusters).
This work provides a solid basis to further expand research in the field of
genomic data processing and proved possible to run algorithms of high complex-
ity, such as O(n2 ), on big datasets in reasonable time.
Acknowledgments. The reasearch presented in this paper was funded by the Swiss
PASC initiative in the framework of the PoSeNoGap (Portable Scalable Concurrency
for Genomic Data Processing) project. The authors would like to thank all the par-
ticipants for the fruitful discussions, namely Ioannis Xenarios, Thierry Schüpbach and
Daniel Zerzion from SIB, Marco Mattavelli and Claudio Alberti from EPFL.
References
1. Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic
sequence databases with the burrows-wheeler transform. Bioinformatics 28(11),
1415–1419 (2012)
2. Deorowicz, S., Grabowski, S.: Compression of DNA sequence reads in FASTQ
format. Bioinformatics 27(6), 860–862 (2011)
3. Du, K.L.: Clustering: a neural network approach. Neural Netw. 23(1), 89–107
(2010)
4. Fritz, M.H.Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high
throughput DNA sequencing data using reference-based compression. Genome Res.
21(5), 734–740 (2011)
5. Hussain, H.M., Benkrid, K., Seker, H., Erdogan, A.T.: FPGA implementation of
k-means algorithm for bioinformatics application: an accelerated approach to clus-
tering microarray data. In: 2011 NASA/ESA Conference on Adaptive Hardware
and Systems (AHS), pp. 248–255, June 2011
6. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8),
651–666 (2010). Award Winning Papers from the 19th International Conference on
Pattern Recognition (ICPR) 19th International Conference in Pattern Recognition
(ICPR)
7. Pinho, A.J., Pratas, D., Garcia, S.P.: Green: a tool for efficient compression of
genome resequencing data. Nucleic Acids Res. 40(4), e27 (2011)
8. Pollard, K.S., van der Laan, M.J.: Bioinformatics and computational biology solu-
tions using R and bioconductor. In: Gentleman, R., Carey, V.J., Huber, W.,
Irizarry, R.A., Dudoit, S. (eds.) Cluster Analysis of Genomic Data, pp. 209–228.
Springer, New York (2005)
9. Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer,
R., Schatz, M.C., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical?
Plos Biol. 13(7), e1002195 (2015)
10. Winterstein, F., Bayliss, S., Constantinides, G.A.: FPGA-based k-means cluster-
ing using tree-based data structures. In: 23rd International Conference on Field
programmable Logic and Applications. pp. 1–6, September 2013
A Quantitative Analysis of the Memory
Architecture of FPGA-SoCs
1 Introduction
HW/SW-codesign is a common approach applied in domains where neither pure
hardware nor pure software implementations offer a satisfying solution. It com-
bines the advantages of both hardware and software and therefore delivers an
elaborated solution to a given problem. FPGA manufacturers such as Xilinx and
Intel are offering devices, often called FPGA-SoCs, that combine an FPGA logic
fabric and a dedicated processor, which in the end allows for a significant perfor-
mance gain when using HW/SW-codesign compared to pure software solutions.
In order to achieve high speedup, it is clearly important to achieve high
performance of both the hardware and the software. However, without having
sufficient memory bandwidth it is not possible to unleash the full potential of
such a solution. In fact, the memory bandwidth often poses the bottleneck in
HW/SW-codesigns and therefore limits the overall performance: While it is pos-
sible to achieve a very high throughput in an FPGA, the memory interface is in
many cases not able to provide input and store output data fast enough [1,2].
Therefore, many research papers are only presenting the throughput inside the
FPGA while disregarding the memory bandwidth [3,4]. For this reason, this
work presents an analysis of the memory architecture of FPGA-SoCs.
Two representative low-cost FPGA-SoCs have been chosen for the analysis,
particularly the Zynq-7020 from Xilinx and the Cyclone V SE SoC from Intel.
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 241–252, 2017.
DOI: 10.1007/978-3-319-56258-2 21
242 M. Göbel et al.
Furthermore, the same benchmarks have been performed on the Zynq-7045 from
Xilinx to show the memory bandwidth of a high-performance FPGA-SoC. These
results have also been compared to a system using a configurable soft-core mem-
ory controller from Xilinx. This allows for a comparison of the memory band-
width of FPGA-SoCs with soft-core SoCs using Xilinx’s Microblaze or Intel’s
Nios II. The best configurations for all these devices are discussed and their
respective strengths are highlighted.
The main contribution of this paper is the evaluation of the memory sub-
systems of the Zynq-7000 SoC from Xilinx and the Cyclone V SoC from Intel,
taking into account all of the following:
2 Related Work
Some other work already evaluated the memory bandwidth of FPGA-SoCs. First
results are given by Sadri et al. [5]. They analyzed the memory interfaces of the
Zynq-7020 with a focus on the Accelerator Coherency Port (ACP), which allows
coherent access from IP cores implemented in logic to main memory. The results
show that it is possible to achieve a full-duplex throughput of up to 1.7 GB/s
when using a single port between memory and programmable logic, with the IP
core running at a fixed frequency of 125 MHz.
Sklyarov et al. [6] also evaluated the Zynq-7020. Although the maximum
bandwidth at the chosen frequency of 100 MHz is not given explicitly, it can be
derived from the results that the achieved maximum bandwidth is significantly
lower than the theoretical maximum (e.g. 284 MB/s for a 64-bit port when read-
ing and writing 32 KB instead of the theoretically possible 800 MB/s).
Furthermore, Tahghighi et al. [7] present a mathematical model that allows
to estimate the latency of a memory access from the programmable logic. While
the model covers several parameters, it is currently limited to the Zynq-7000. It
also does not give an overview of the available memory bandwidth for different
A Quantitative Analysis of the Memory Architecture of FPGA-SoCs 243
access patterns. Similar to [5], it does not cover the combination of multiple
ports to increase the overall memory bandwidth.
Although these papers provide valuable information, several of our questions
remain unanswered. For instance, the combination of multiple ports yields a sig-
nificant increase in bandwidth thus expanding the field of applications suitable
for FPGA-SoCs to a broader range. While this is analyzed in [6], their results
are surprisingly low. In comparison, our results show a significantly higher band-
width when combining multiple ports. Furthermore, to the best of our knowledge,
our work is the first to include multiple devices that cover a large part of the
market (Xilinx’s Zynq-7020 and Zynq-7045 + Intel’s Cyclone V SE SoC + Xil-
inx’s Microblaze) while all the related papers only use the Zynq-7020 for their
evaluations thus limiting their impact.
3 FPGA-SoCs
FPGA-SoCs are devices that contain a dedicated hard-core processor with var-
ious peripherals and programmable logic. Both components are located on the
same chip, which allows them to be tightly coupled. Such devices are offered by
Xilinx [8] and Intel [9]. Both combine a 32-bit dual-core ARM Cortex-A9 based
CPU with programmable logic. This CPU uses the ARMv7-A architecture and
support NEON SIMD instructions. A two-level cache hierarchy is available that
provides 32 KB of L1 per core and a shared 512 KB L2 cache.
Xilinx offers the Zynq-7000 family of so-called All-programmable SoCs while
Intel offers SoCs as part of their Cyclone, Stratix and Arria product lines to
cover the whole market. Both vendors have already announced successors to
their current FPGA-SoCs, featuring a 64-bit quad-core ARM Cortex-A53 CPU
and more logic resources. However, as they were not publicly available at the
time of this work, they could not be included.
While Xilinx devices use only support the ARM AXI standard, Intel supports
AXI as well as their own Avalon standard. For the sake of comparison, only the
AXI mode of Intel’s devices was taken into account. Both vendors offer a variety
of master and slave ports suitable for different applications. As the master ports
(i.e. the CPU is the master) cannot be used directly to access the DDR memory
from the programmable logic, these ports will not be discussed in this work.
Xilinx’s Zynq-7000 devices offer the following ports for the programmable
logic to access memory:
1. General-purpose (GP) ports
These two ports have a fixed width of 32 bits and no internal buffers, making
them a good choice for low-throughput applications.
2. High-performance (HP) ports
Four slave ports with widths of either 32 or 64 bits with built-in FIFOs are
available for high-throughput applications.
3. Accelerator Coherency Port (ACP)
This additional 64-bit port resembles the HP ports. However, the ACP allows
cache-coherent access to the memory.
244 M. Göbel et al.
Fig. 1. The engines that are used to perform one- or two-dimensional accesses to main
memory. A register-based AXI-Lite interface for control tasks and a full-scaled AXI
master interface for data transfer connect the engine to the CPU and the main memory.
Note that the gray blocks are only required for the write engine.
For Intel’s SoCs, the layout of the ports for accessing memory is as follows:
In this section, the designs and implementations of the so-called memory engines
are presented briefly. These engines allow to gain the required insights into the
potential bandwidth of the different ports. They are designed to support one-
and two-dimensional access to memory with a fixed stride, as well as trace-based
inputs, i.e. a list of specific memory transactions. As this work focuses on high-
throughput applications, the GP ports of the Zynq-7000 and the F2H port of
Intel’s SoC are not evaluated.
Figure 1 shows the general structure of the Write Engine that is used to
determine the achievable write bandwidth for different scenarios. It has two
different AXI interfaces: a full-scale AXI master interface for the actual memory
access connected to one of the ports mentioned in Sect. 3 and a register-based
AXI-Lite interface for control and configuration purposes. The latter is connected
to the CPU using dedicated AXI ports that are not suitable for memory access.
While Xilinx and Intel offer IP cores supporting AXI4, their FPGA-SoCs only
support AXI3 for memory access. Therefore, the maximum number of bursts in
one request is 16. By using the control interface, the specific scenario in terms of
A Quantitative Analysis of the Memory Architecture of FPGA-SoCs 245
height and width of the access as well as the stride for two-dimensional access,
i.e. the offset between two bytes in the same column, can be controlled.
The parameters stored in these registers are used by a Control Unit, which
splits the two-dimensional block into one-dimensional transactions if necessary.
These requests are afterwards converted into AXI transactions by an Address
Generator. This unit is connected to the address lines of the AXI interface
and drives the required signals. In addition, it deals with alignment issues. The
requests are buffered in a FIFO from which they are read by a Data Generator.
It writes the requested amount of data from a Pseudo-Random Binary Sequence
(PRBS) generator to main memory.
To accurately measure the throughput of each operation, a Monitor has been
added that measures the number of cycles the operation takes. It communicates
with the CPU by using the register interface.
The implemented Read Engine for reading data from main memory has a
very similar structure. However, as no data has to be generated and written for
reading data, the corresponding generator and the FIFO are not required in this
case.
While the first benchmark gives an overview of the bandwidth that can be
expected for a given width and height, the latter allows to measure the band-
width for a real-world scenario with a mix of different block sizes. In this section,
a comparison of the Zynq-7020 and the Cyclone V SoC will be discussed, as
these are two chips in the same price segment. Later, the same benchmarks will
be used to evaluate a high-performance FPGA-SoC, the Zynq-7045, in order to
show the difference between low-cost FPGA-SoCs and high-performance FPGA-
SoCs. Finally, a comparison to a system which uses Xilinx’s soft-core memory
controller instead of the hard-core memory controller of an FPGA-SoC will be
presented. This allows comparing the bandwidth of the memory controller of an
FPGA-SoC with that of a soft-core SoC such as Xilinx’s Microblaze or Intel’s
Nios-II running on an FPGA.
All the benchmarks used in this work are optimized for high bandwidth. As
a result, the highest possible number of data beats per burst is used.
246 M. Göbel et al.
Cyclone V SoC and Zynq-7020. The experiments in this part have been
performed using the DE1-SoC Board from Terasic that features Intel’s Cyclone
V SoC and the Zedboard from Digilent with Xilinx’s Zynq-7020. The bandwidth
is given in MiB/s, i.e. 220 bytes/s, and not in 106 bytes/s.
In order to get an overview of the achievable throughput for accessing differ-
ent patterns in main memory, a synthetic benchmark has been used. It takes the
width and height of the block being processed as well as the stride as parameters.
The analyzed configurations include cached and non-cached software implemen-
tations as well as hardware implementations with different number of HP ports
(Xilinx) or different widths of the F2S port (Intel) and with the ACP.
To have a reasonable baseline, the software implementations are NEON-
accelerated, i.e. they use SIMD memory instructions to maximize the through-
put. The non-ACP hardware implementations have been performed using a fixed
frequency of 110 MHz for both the memory engine and the AXI bus, while the
ACP implementation uses a frequency of 100 MHz. These are the maximum fre-
quencies, i.e. the highest frequencies for which the memory engines could be
placed and routed on all devices. The CPU on the Intel device is running at 800
MHz and also uses 800 MT/s for the memory controller. Xilinx uses a CPU with
a frequency of 666 MHz, but 1066 MT/s to access the DDR memory. Due to
the different memory data rates, the theoretical maximum bandwidth for DDR
memory access is higher for the Zynq-7020 (4066 MiB/s) than for the Cyclone
V SoC (3052 MiB/s). For all hardware experiments, the memory controller has
been configured to prioritize the programmable logic memory ports and therefore
minimize the impact of parallel memory accesses from software.
Figure 2(a)-(f) shows the results for the software and the non-ACP hardware
scenarios. In this figure, a fixed stride of 1 MiB and a fixed height of 50 rows
have been used while the width in bytes is the variable parameter with a range
from 1 byte to 1 MiB. The choice of a height of 50 rows has been made as heights
in this range are found quite often in video coding applications, an important
domain when analyzing two-dimensional memory accesses. An example is the
block structure of HEVC/H.265 [10]. A fixed stride of 1 MiB has been used
as the stride must be larger or equal to the width. Thus, this choice allows for
evaluating different memory accesses with a width of up to 1 MiB while using the
same stride. Due to the choices of height and stride, this can either be interpreted
as a single two-dimensional access with a height of 50 and a stride of 1 MiB or
as 50 one-dimensional accesses with a fixed distance of 1 MiB between them.
Therefore, it provides information for one- as well as two-dimensional access.
For reading, the non-cached SW baseline has the lowest throughput for both
devices with a maximum bandwidth of 256 MiB/s on the Zynq-7020 and 150
MiB/s on the Cyclone V SoC. On the other hand, for the cached SW baseline, the
Intel device has a significantly higher bandwidth of up to 996 MiB/s compared
to a maximum of 751 MiB/s for its Xilinx counterpart. These differences are
probably caused by the lower frequency of the Xilinx CPU and therefore of the
caches. However, starting at around 16 KiB, i.e. the width where the 512 KiB L2
A Quantitative Analysis of the Memory Architecture of FPGA-SoCs 247
Fig. 2. The bandwidth (BW) for a fixed stride of 1 MiB and a height of 50 rows. The
HW implementations are running at 110 MHz (Zynq-7020 and Cyclone V SoC) and
214 MHz (Zynq-7045 4x/2x) or 250 MHz (Zynq-7045 1x). The CPUs are running at
666 MHz (Zynq-7020) or 800 MHz (Zynq-7045 and Cyclone V SoC). Note that for the
combined read and write transactions the added bandwidth for reading and writing is
given.
248 M. Göbel et al.
Cache can no longer hold the entire 50 rows, the Zynq-7020 again outperforms it
counterpart. The stride of 1 MiB induces several cache misses in this case, which
allows for comparing it to the other non-cached accesses in this benchmark.
For the 64-bit HW implementation, both devices are limited by the low AXI
bus frequency of 110 MHz resulting in a bandwidth of 839 MiB/s. By using all
four HP ports or a 256-bit F2S port, higher bandwidths of up to 3337 MiB/s for
the Zynq-7020 and up to 2590 MiB/s for the Cyclone V SoC can be achieved.
The difference is caused by the higher memory data rate for the Zynq-7020 of
1066 MT/s. It can also be seen that the 256-bit F2S port of the Cyclone V SoC
requires a higher block width to reach its maximum bandwidth. Both devices
behave similarly when using two 64-bit ports in parallel, reaching a maximum
of 1644 MiB/s (Cyclone V SoC) and 1689 MiB/s (Zynq-7020), respectively. In
particular, for small block widths it turns out to be more reasonable to use two
64-bit ports than using one 256-bit port.
Figure 2 also shows the writing results for the same settings, again for the
software and non-ACP hardware scenarios. The main difference is the improved
cached SW baseline for both devices. For the Cyclone V SoC it is even compara-
ble to the 256-bit HW implementation. In general, for the HW implementations,
the same behavior as for reading can be seen: The 64-bit implementation is
limited by the AXI interconnect frequency, while the 256-bit solution of Xilinx
outperforms the Intel one.
The plots (e) and (f) in Fig. 2 show the result of reading and writing in
parallel. As the read and write signals of an AXI interface are independent
from each other, both operations can be performed simultaneously. This has
been accomplished by instantiating a read and a write engine in parallel. For
the 64-bit and the 2 × 64-bit HW implementations, the bandwidth has increased
significantly. This is caused by the increase of the bus width: As two independent
data busses are used for reading and writing, the effective bus width is doubled.
While the former experiments deal mostly with non-coherent accesses,
Fig. 3(a)-(d) compares reading from main memory using the ACP in coherent
mode running at 100 MHz to the NEON-accelerated SW baseline. The chosen
scenario uses a stride of 1 MiB and a height of 5, 10, 20 or 100 rows. The dif-
ferent heights are required to analyze the impact of the cache architecture on
the bandwidth. To see the full impact of caching, the same operation has been
performed 100 times before starting the actual measurements as this reduces the
number of cold cache misses.
For the SW baseline it can be seen that caching is especially useful for small
heights. For a height of 5 rows and a fixed width of 4096 bytes a bandwidth of
3839 MiB/s and 5441 MiB/s can be seen, respectively. On the other hand, for
larger heights some rows are removed from cache due to conflicting cache misses,
which results in a higher miss rate. In fact, for small widths it is even possible
on the Cyclone V SoC to achieve bandwidths higher than the maximum DDR
bandwidth of 3052 MiB/s.
For the ACP, the bandwidth is significantly lower compared to the SW base-
line. The data bus width of 8 bytes and the employed frequency of 100 MHz
A Quantitative Analysis of the Memory Architecture of FPGA-SoCs 249
Fig. 3. The cached read bandwidth (BW) for a fixed stride of 1 MiB. Note the different
scale for the Zynq-7045. For all scenarios, the same transactions have been performed
100 times before starting the measurement in order to fill the caches and therefore
maximize the throughput.
limit the bandwidth to 763 MiB/s. In fact, for widths smaller than 256 bytes,
a higher bandwidth can simply be reached by performing non-coherent accesses
on the ACP. Anomalously high is the ACP bandwidth for a width of 32 bytes.
As this behavior occurs on both devices, it indicates a general limitation of the
ACP port.
BW in MiB/s
1,500 Cyclone V SoC Zynq-7020 Zynq-7045
1,000
500
0
SW rent heren
t
it F2
S
it F2
S
it F2
S
Cohe n-Co 64-b 28-b 56-b
ACP N o l e / l / 1
uad
/ 2
ACP HP S
ing
HP D
ua
HP Q
Fig. 4. The achievable read bandwidth (BW) for a trace-based simulation of the mem-
ory accesses of the motion mompensation stage of an H.265/HEVC decoder. A Full
HD video stream with a medium bitrate has been used.
100 MHz for the other two SoCs. Again, these frequencies pose the maximum on
each device for this implementation.
It can be seen that both the SW baseline and the coherent ACP imple-
mentation offer a very low bandwidth of less than 200 MiB/s. In comparison,
non-coherent HW solutions offer a significantly higher throughput. While the
bandwidth does not scale perfectly with the number of ports (Zynq) or the port
width (Cyclone V), it allows to increase the bandwidth significantly this way. As
the difference for 256-bits between the 100 MHz solution on the low-cost FPGA
SoCs and the 214 MHz solution on the Zynq-7045 is rather small, the bottleneck
is apparently not located in the AXI bus, but instead in the memory controller
itself. For the HP Quad solution on the Zynq-7045, a bandwidth of 1515 MiB/s
can be reached, which is sufficient for real-time Full HD decoding [12].
The theoretical maximum of 4066 MiB/s on the Zynq cannot be reached,
however. This can be explained with the different block sizes: As can be seen
in Fig. 2(g), the expected bandwidth when using four HP ports is below 1000
MiB/s for those blocks with the smallest width (5 bytes) in this workload. On
the other hand, a bandwidth of almost 4000 MiB/s can be reached for those
blocks with the largest width (142 bytes). As a result, the actual bandwidth is
in between these two extremes. An analysis of the block sizes for the workload
shows that almost 50% of the blocks have a width smaller than 16 bytes and
more than 80% of the blocks have a width smaller than 32 bytes. Therefore, the
small memory accesses dominate which results in a relatively low bandwidth.
6 Conclusions
In this paper, three different FPGA-SoCs from Xilinx and Intel have been evalu-
ated regarding their memory bandwidth. In particular, two low-cost devices, the
Zynq-7020 from Xilinx and the Cyclone V SoC from Intel, have been com-
pared. The Zynq-7045 from Xilinx has been evaluated as an example for a
high-performance FPGA-SoC. By using several synthetic benchmarks, it has
been possible to determine the memory bandwidth for various scenarios. A real
workload from the field of video coding has been applied as well. Finally, the
252 M. Göbel et al.
References
1. Fu, H., Clapp, R.: Eliminating the memory bottleneck: an FPGA-based solution
for 3D reverse time migration. In: 19th ACM/SIGDA International Symposium on
Field Programmable Gate Arrays (FPGA), Monterey, USA (2011)
2. Naylor, M., Fox, P., Markettos, A., Moore, S.: Managing the FPGA memory wall:
custom computing or vector processing? In: 23rd International Conference on Field
Programmable Logic and Applications (FPL), Porto, Portugal (2013)
3. Dobai, R., Sekanina, L.: Image filter evolution on the Xilinx Zynq platform. In:
NASA/ESA Conference on Adaptive Hardware and Systems (AHS), Turin, Italy
(2013)
4. Ishikawa, S., Tanaka, A., Miyazaki, T.: Hardware accelerator for BLAST. In:
6th IEEE International Symposium on Embedded Multicore SoCs (MCSoC),
Aizu-Wakamatsu, Japan (2012)
5. Sadri, M., Weis, C., Wehn, N., Benini, L.: Energy and performance exploration of
accelerator coherency port Using Xilinx Zynq. In: ACM 10th FPGAWorld Confer-
ence, Copenhagen, Denmark, Stockholm, Sweden (2013)
6. Sklyarov, V., Skliarova, I., Silva, J., Sudnitson, A.: Analysis and comparison of
attainable hardware acceleration in all programmable systems-on-chip. In: 2015
Euromicro Conference on Digital System Design (DSD), Funchal, Portugal (2015)
7. Tahghighi, M., Sinha, S., Zhang, W.: Analytical delay model for CPU-FPGA data
paths in programmable system-on-chip FPGA. In: 12th International Symposium
on Applied Reconfigurable Computing (ARC), Mangaratiba, Brazil (2016)
8. Zynq-7000 All Programmable SoC Technical Reference Manual by Xilinx.
http://www.xilinx.com/support/documentation/user guides/ug585-Zynq-7000-T
RM.pdf
9. Altera’s User-Customizable ARM-based SoCs by Altera. http://www.altera.com/
literature/br/br-soc-fpga.pdf
10. Sullivan, G., Ohm, J.-R., Han, W.-J., Wiegand, T.: Overview of the High Effi-
ciency Video Coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol.
22(12), 1649–1668 (2012)
11. 7 Series FPGAs Memory Interface Solutions User Guide by Xilinx
12. Chi, C.C., Alvarez-Mesa, M., Bross, B., Juurlink, B., Schierl, T.: SIMD acceleration
for HEVC decoding. IEEE Trans. Circuits Syst. Video Technol. 25, 841–855 (2014)
Neural Networks
Optimizing CNN-Based Object Detection
Algorithms on Embedded FPGA Platforms
Ruizhe Zhao1(B) , Xinyu Niu1 , Yajie Wu2 , Wayne Luk1 , and Qiang Liu3
1
Imperial College London, London, UK
{ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk
2
Corerain Technology, Shanghai, China
james.wu@corerain.com
3
Tianjin University, Tianjin, China
qiangliu@tju.edu.cn
1 Introduction
(e.g. different convolution layers can have different kernel sizes, such as 3 × 3,
7 × 7 and 11 × 11), which increases the difficulty of designing generic hard-
ware modules that can be adapted to varying parameters. (2) Object detection
algorithms use deep and complex CNN architectures, which makes it hard to
fit the network into an FPGA and to decide the optimal parameters of hard-
ware modules. (3) Multiple backbone CNN architectures are available to an
object detection algorithm, and the more accurate an architecture can achieve,
the more hardware resources it will require.
Our main contribution in this paper is a CNN accelerator design customised
for object detection algorithms on an embedded FPGA platform. This design
can tackle those three aforementioned challenges: (1) This design is built upon
parameterised hardware modules that can be configured for different layer para-
meters. (2) We develop design models for estimating resource usage of deep CNN
architectures. (3) We present an optimisation flow that treats two CNN-based
object detection algorithms (YOLO and Faster RCNN) and their backbone CNN
architectures as candidates, in order to find the optimal hardware design under
different optimisation targets (e.g. speed or accuracy). At the end of this paper,
we provide evaluation results for both the design model accuracy and the per-
formance of the optimal hardware design. To the best of our knowledge, this is
the first work to support end-to-end development of CNN-based object detection
applications with FPGA accelerators.
C
Of = conv(Ic , Kf,c ) + bf (1)
c=1
This equation means that each output filter will sum up all convolution results
between each channel of the input feature map (Ic ) and the kernel (Kf,c ).
In many architectures, an activation function can be applied to the result
elements, like Rectified Linear Unit (ReLU).
Optimizing CNN-Based Object Detection Algorithms 257
Popular CNN Architectures. There are many CNN architectures, but only a
few of them have been validated on well-known datasets, and they are viewed as
state-of-the-art CNN architectures. The following are some CNN architectures
used in object detection algorithms. (1) VGG16 [11] is one of the VGGNet ver-
sions with 16 convolution layers and 2 pooling layers. An appealing feature of
VGGNet is that it has homogeneous kernel size (3 × 3) for all convolution lay-
ers, and is easy to implement on hardware accelerators. (2) Zeiler-and-Fergus
(ZFNet) [15] is the winner of Image-Net Large-Scale Vision Recognition Chal-
lenge (ILSVRC) 2013. It is shallower than the VGGNet, and has different ker-
nel size for different convolution layers. (3) GoogLeNet [14] is the winner of
ILSVRC 2014. It discovers strategies to reduce the number of parameters in con-
volution layers, and replaces the fully-connected layers with the Average Pooling
layer.
3 Architecture
This section presents the basic architecture of our hardware design, which con-
sists of two kernels: conv kernel and fc kernel (Fig. 1). Each kernel contains an
input buffer to cache data for further re-use, a computation kernel to perform
convolution (conv) or matrix vector multiplication (fc), and an output buffer
to store partial result before the final result is ready. Here we introduce these
three components for each kernel in detail.
Fig. 1. A general architecture for the convolution layer (kernel size 3 × 3) with three
different level of parallelism (PP , PV , and PF ). The top-left part is the line buffer.
Optimizing CNN-Based Object Detection Algorithms 259
The computation kernel inside conv contains several convolution kernels run-
ning in parallel, which consists of multiple multipliers followed by an adder tree.
Suppose the width of a coefficient kernel is k, then the number of multipliers
is k 2 , and the depth of the adder tree is log(k). Multipliers take input from a
customised input buffer called line buffer [1], which enables k data read in one
clock cycle from the input feature map. The other side of the line buffer connects
to a larger input buffer that partly or fully contains the input feature map. Mul-
tipliers also connect to another input buffer that caches coefficients. The output
buffer in the conv kernel stores the partial convolution result. In each cycle the
result from the adder tree will be used to update the partial result. Data type
in the conv kernel is single-precision floating-point.
The major functionality of the fc kernel is to perform dot product between
the reshaped input feature vector and the coefficient matrix. The computa-
tion kernel contains several multipliers in parallel to calculate the dot product
between each row of the coefficient matrix and the feature vector. There are two
ways to organise buffers: to cache the whole feature vector and store no partial
output, or to store the partial result and no input buffer. These two methods
are related to the computation sequence we choose for the fc (row major or col-
umn major), which will be discussed in Sect. 4. Because there is a feedback loop
within the dot product, we use fixed-point data type to enhance performance.
The bit width of the fixed-point data type used is 32, which contains 23 fraction
bits and 8 integer bits.
Data Access Pattern. Data access pattern is critical to conv kernel implemen-
tation, because we could choose to compute the convolution either by channels
in the feature map, or by filters in the output. Each of these patterns has a
trade-off between the input and output buffer size.
Consider two nested loops in Eq. 1, one iterates the channel and the other
iterates the filter (Algorithm 1). Thus we have two access patterns: filter major
and channel major. The main difference between these two patterns lies in
memory usage. The following will calculate the input and output buffer size. (1)
Filter major: Algorithm 1 presents the filter major pattern. Once we complete
the inner reduce add loop of channels for each output filter f in the filter major
pattern, the final result for this filter will be ready. Thus, we only need to store
BH BW /s2 , which is the shape of one output filter, in the output buffer. However,
it needs to iterate through all the channels of the input feature map and the
associated coefficient kernel, so the input buffer size of the filter major pattern
is (BH BW + k 2 )NC + kBW , where kBW is the line buffer size. (2) Channel
major: In this case, the channel iteration is the outer loop. After each iteration
in the outer loop, only partial results for all NF filters are available and they will
be updated in the following iterations. Thus the output buffer is required to have
Optimizing CNN-Based Object Detection Algorithms 261
size NF × BH BW /s2 . For the input buffer, only one channel of the input feature
map needs to be cached, but all the coefficients for this channel should also be
stored in the input buffer. Hence the input buffer size is BH BW + k 2 NF + kBW .
The line buffer is also required for this case.
Table 2 summarises the buffer usage for these two data access patterns. With
these parameterised analyses, it is convenient to decide which data access pattern
should be used based on the parameter values. In general, although these two
patterns have similar buffer usage, it is better to choose channel major as it has
simpler control logic.
5 Optimisation Flow
This section presents our optimisation flow for CNN-based object detection algo-
rithms. The optimisation flow has three major steps: strategy selection, parame-
ter tuning, and algorithm-specific optimisation.
Strategy Selection. Once we have the CNN network architecture configura-
tion, we are able to select which strategy to use for each layer. There are two
aforementioned strategies, one is the data access pattern for the conv kernel,
and the other one is the computation sequence for the fc kernel. The selection
will be based on this algorithm: For each layer i,
262 R. Zhao et al.
1. If layer i is a conv layer, then compare the buffer usage of all data access
patterns and find the one uses minimal buffer in total.
2. If layer i is a fc layer, then compare Mi and Ni to decide whether to use the
row major or the column major strategy.
After selecting strategies for each layer, we can derive exact expressions of
the maximum BRAM usage and the maximum level of parallelisation, which are
decided by both Table 2 and fc’s Mi and Ni .
Parameter Tuning. Suppose we are using the channel major data access pat-
tern and row major strategy, which are suitable for most cases, we need to further
tune several parameters to optimise the amount of parallelism.
1. Pipeline depth (PP ): For conv or fc, PP represents the number of kernels
to support in hardware. The supported layers can be connected as a pipeline,
with the output of a layer to be the input for the next layer.
2. Filter width (PF ): For conv only, PF represents the number of filters
processed in parallel, which has an upper bound NF .
3. Vector width (PV ): For conv or fc, PV represents the amount of input
data processed in parallel. While computing convolution between one kernel
and one channel’s feature map, it is possible to compute multiple kernels in
parallel. This level of parallelisation can be measured by the width of input
vector in each cycle.
Convolution Layer. Based on the above parallelism parameters, we need to
modify the line buffer size, which should be PV BW to support PV read operations
Optimizing CNN-Based Object Detection Algorithms 263
in parallel. Besides, we derive the expression for the on-chip (BWiconv ) and DDR
conv
(BW i ) bandwidth, estimated to be:
conv fc
c
Logic L (PP × PF × PV ) Lf (PP × PV )
conv conv
BSi × DW BWi × DW Ni × DW 2PV × DW
Memory max , max ,
BRAMsize BRAMbw BRAMsize BRAMbw
1 1 1 1
DDR PV PF + PV + +1
Fi Ci Mi Ni
6 Evaluation
This section describes our evaluation and performance analysis of the hardware
design with specific resource constraints and network architecture. We choose to
measure the performance for the YOLO algorithm with the GoogLeNet back-
bone.
Implementation Details. We briefly introduce the implementation detail of
our hardware design. We present the overall architecture in Sect. 3. The pro-
posed architecture and optimisation flow can target various FPGA platforms.
To illustrate our approach, our hardware design is built for the Xilinx Zynq
platform (zc706), which contains two main components: PS and PL. PS is the
processing system with an ARM CPU and a DDR memory, while PL refers to
the FPGA, which contains logic resources, on-chip memory, and DMA support.
In our case, CNN hardware design targets the PL part, with some complex soft-
ware algorithms running on the PS part. We use the AXI to connect between
PS and PL.
The CNN hardware design can be split into conv kernel and fc kernel. They
are parameterised and are connected to each other through FIFO. They use our
streaming protocol to control and schedule tasks. Coefficients and other external
data will be loaded through DDR from the external memory.
Design Model Accuracy. We estimate the design model accuracy from the
synthesis report and the estimated resource usage on 3 different cases: PV =
1, 4, 8 (Fig. 3). Here the kernel size of the conv module is 7 × 7, and the column
number of the fc module is 4096. The estimation is based on equations in Table 3.
The design model accuracy is beyond 85%, and therefore it can support our
optimisation flow. The dotted line stands for available resources in our target
chip. Thus, we select PV = 4 in this design.
Algorithm Evaluation. Based on the optimisation model, we derive the
optimal design parameters for both YOLO (GoogLeNet) and Faster RCNN
Optimizing CNN-Based Object Detection Algorithms 265
0.5
0
fc1 fc4 fc8 conv1 conv4 conv8 total1 total4 total8
Fig. 3. Design model accuracy measured with the synthesis report and the model esti-
mation results. Resource usage is normalised against available resources in the target
chip. The last digit of each label is the PV value.
(VGG16), and predict the best performance for these two algorithms. In addi-
tion, we also evaluate the software performance on x86 CPU and ARM CPU.
We use Darknet [8] and Caffe [6] as the software reference for YOLO and Faster
RCNN evaluation. Results are listed in Table 4.
Based on the optimization model, we make a few decisions. (1) Input and
output buffers are necessary so that the design has the appropriate bandwidth.
(2) For the 1 × 1 kernel, the 25 BRAM requirement is not the major limitation
in resource usage. (3) At current precision, the DSPs are the limiting resources
for conv kernels. We can set PV = 4 and PP = PF = 1 in this case. (4) fc kernel
also uses PV = 4 to coordinate with the conv kernel’s output.
We estimate that the overall execution time for YOLO (GoogLeNet) is
0.744 s, and for Faster RCNN (VGG16) is 0.875 s. Compared with the best
software performance on ARM (36.92 s), the speed-up is 49.6 times. Even com-
pared with the x86 CPU there is a 1.5 times speed-up. Although the GPU version
is much faster than our implementation, the GPU (Titan X) is not suitable for
embedded systems. Also the total energy cost of the FPGA version (0.868J) is
much smaller than the GPU version (23J).
7 Summary
This paper presents our novel approach to optimise CNN-based object detec-
tion algorithms on embedded FPGA platforms, which consists of a design model
for the basic CNN hardware architecture, and an optimisation flow which takes
into account both FPGA optimisation strategies and algorithm-specific optimi-
sation strategies. Our evaluation shows that an optimised hardware design for
the YOLO algorithm with GoogLeNet backbone can reach 49.6 times speed-up
266 R. Zhao et al.
compared with software on ARM. Also our design model accuracy is above 85%.
Future work includes evaluating the object detection application with multiple
real world datasets, introducing automatic data quantisation, and enhancing the
optimisation flow to support CNN training.
References
1. Bosi, B., et al.: Reconfigurable pipelined 2-D convolvers for fast digital signal
processing. IEEE Trans. VLSI Syst. 7(3), 299–308 (1999)
2. Chakradhar, S., et al.: A dynamically configurable coprocessor for convolutional
neural networks. In: ISCA (2010)
3. Dai, J., et al.: R-FCN: object detection via region-based fully convolutional net-
works. arXiv preprint (2016). arXiv:1605.06409
4. Farabet, C., et al.: NeuFlow: a runtime-reconfigurable dataflow processor for vision.
In: ECVW (2011)
5. Girshick, R.: Fast R-CNN. In: ICCV (2015)
6. Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. arXiv
preprint (2014). arXiv:1408.5093
7. Qiu, J., et al.: Going deeper with embedded FPGA platform for convolutional
neural network. In: FPGA (2016)
8. Redmon, J.: Darknet: open source neural networks in C (2013–2016). http://
pjreddie.com/darknet/
9. Redmon, J., et al.: You only look once: unified, real-time object detection (2015).
https://arxiv.org/abs/1506.02640
10. Ren, S., et al.: Faster R-CNN: towards real-time object detection with region pro-
posal networks. In: NIPS (2015)
Optimizing CNN-Based Object Detection Algorithms 267
11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. ImageNet Challenge (2014)
12. Suda, N., et al.: Scalable and modularized RTL compilation of convolutional neural
networks onto FPGA. In: FPL (2016)
13. Suda, N., et al.: Throughput-optimized OpenCL-based FPGA accelerator for large-
scale convolutional neural networks. In: FPGA (2016)
14. Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)
15. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks.
In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol.
8689, pp. 818–833. Springer, Cham (2014). doi:10.1007/978-3-319-10590-1 53
16. Zhang, C., et al.: Optimizing FPGA-based accelerator design for deep convolutional
neural networks. In: FPGA (2015)
An FPGA Realization of a Deep Convolutional
Neural Network Using a Threshold
Neuron Pruning
1 Introduction
requirement of the embedded vision system, since the existing system using a
CPU is too slow, the acceleration of the CNN is necessary [17]. Most software-
based CNNs use the GPUs [2,3,7,23,24]. Unfortunately, since the GPU con-
sumes much power, they are unsuitable for the embedded system [9]. Thus,
FPGA-based CNNs are required for a low-power and a real-time embedded
vision system. As for the classification accuracy, the CNN using a fixed-point
270 T. Fujii et al.
representation has almost the same accuracy as one using a floating-point rep-
resentation [11]. The FPGA can use a minimum precision which reduces the
hardware resources and increases the clock frequency, while the GPU cannot
do it. A previous work [9] reported that, as for the performance per power, the
FPGA-based CNN is about 10 times more efficient than the GPU-based one.
X0=1
W0 (Bias)
X1 W1
Y
W2 fact(Y) Z
X2
...
Wn
Xn
The rest of the paper is organized as follows: Sect. 2 introduces the convolutional
deep neural network (CNN); Sect. 3 introduces the neuron pruning in the fully
connected (FC) layer on the CNN; Sect. 4 shows the serial-input parallel-output
FC circuit; Sect. 5 shows the experimental results; and Sect. 6 concludes the paper.
272 T. Fujii et al.
where X0 is a constant one and W0 denotes a bias which corrects the deviation of
the given data. Typically, the activation function is realized by a sigmoid, a tanh,
a ReLU [18], and so on. In the paper, we use the ReLU function which is suitable
to a hardware realization. A convolutional deep neural network (CNN)
has multiple layers. Figure 4 shows an example of the CNN. The typical layer
consists of a 2D convolutional layer, a pooling layer, and a classification
layer. Each layer consists of multiple feature maps. To recognize the input
image, first, the feature map reacts corresponding subdivided training data by
2D convolutional layers with pooling layers. Then, the classifier selects the appro-
priate reactions from feature maps. Usually, the classifier is realized by the fully
connected neural network. In this paper, for layer i, Ki denotes the kernel size,
Ni denotes the number of feature maps, and Li denotes the feature map size.
Figure 5 shows the 2D convolution operation. It computes the output by shifting
Kernel
+ f(x)
(Ni+1 F. maps)
(Ni F. maps)
+ f(x)
a K × K size kernel. For (x, y) at the output feature map value i + 1, the
following MAC (multiply-accumulation) operation is performed:
Ni −1 K−1
K−1
Yi+1,x,y = ( Xk,x+m,y+n Wk,m,n ) (1)
k=0 m=0 n=0
Zi+1,x,y = fact (Yi+1,x,y ).
+ R
Weight
Memory
+ R
+ R
+ R
Weight
Weight
Memory
Memory
Layer #1
+ R R
+ R R
+ R R
Fig. 10. Circuit for SIPO fully connected layers with the threshold neuron pruning.
access. On the other hand, since the neuron pruning eliminates all the incom-
ing and outgoing edges, it maintains the sequentially memory access of weights.
Thus, it is suitable for the hardware realization.
First, we define the neuron pruning.
Definition 3.1. A neuron pruning eliminates all the incoming and outgoing
edges for a neuron.
There are various decisions of thresholds for the neuron pruning. In the paper,
the threshold neuron pruning is performed, when one of the following conditions
is satisfied:
276 T. Fujii et al.
n
1. k=1 |win,k | < µi × n
m
2. k=1 |wout,k | < µo × m,
where win,k denotes the k-th weight for the incoming edge, wout,k denotes the
k-th weight for the outgoing one, µi denotes the threshold for the incoming
edge, and µo denotes that for the outgoing edge (Fig. 6). In this paper, different
thresholds are used for incoming edges and outgoing ones.
weights is eliminated by the neuron pruning, and only a few part of weights is
packed in the weight memories. Since the FPGA can realize the appropriate size
of the memory with the block RAMs (BRAM) and the distributed memories, it
is suitable to realize the neuron pruning. All the weights for each layer are read,
and the output neurons are updated at a time. After all the inputs are evaluated,
it transfer the values for the output neurons to the shift register. Then, the next
layer is evaluated by shifting the value for the shift register. When all the layers
are evaluated, the values for the output neurons are send to the external output.
5 Experimental Results
We designed the CNN using a Chainer which is a deep neural network frame-
work [3], and the target task is the CIFAR-10 [5] which is an image recognition
task. In the experiment, we set an appropriate threshold µ by manually, and
applied the threshold neuron pruning for each fully connected layer.
Table 2 compared the number of neurons for each fully connected layer. Note
that, generally, when the number of neurons decreases, then recognition accu-
racy also decreases. In the comparison, we measured the number of neurons for
the original CNN, the 99% accuracy, and the 95% accuracy compared with the
accuracy for the original one. From Table 2, as for the 99% accuracy, the number
of neurons decreased by 89.3%, while as for the 95% accuracy, it decreased by
91.8%. Table 3 compared the number of 18 Kb BRAMs for each fully connected
layer. From Table 3, as for the 99% accuracy, the number of BRAMs decreased
by 99.0%, while as for the 95% accuracy, it decreased by 99.8%. Let ni be the
number of incoming edges for each layer, no be that of outgoing edges, and w be
the bit precision (in the experiment, we used 8-bit). Since the amount of weight
memory for each layer is ni no w O(n2 ), the neuron pruning exponentially
reduces the amount of memory. In our experiment, for the VGG-11 CNN, we
can realized the weight memory for the fully connected layer by the on-chip mem-
ory on the FPGA. In that case, since it reads weights with a width band-width
memory access, it can operate the fully connected layer with a high-speed. Also,
since it requires no extra off-chip memory, it reduces the power consumption and
costs.
We applied the threshold neuron pruning with the 99% accuracy. Then, we
implemented the fully connected layers on the Digilent Inc. NetFPGA-1G-CML
evaluation board (It has a Xilinx Inc. Kintex 7 XC7K325T FPGA: 50,950 slices,
890 18 Kb BRAMs, and 840 DSP slices). We used the Xilinx Inc. Vivado 2016.2
with timing constrain 100 MHz. Our implementation used 4,241 Slices, 151 18 Kb
BRAMs, and 145 DSP slices. Also, it satisfied the timing constraint for real-
time applications. The delay time for the fully connected layer was 29.0 usec.
278 T. Fujii et al.
We measured the power consumption without that for the power sources on
the board: It was 7 W. Since the implemented fully connected layer operated
with 29.0 usec delay time, its performance was 34482.7 (images/usec). Thus, the
performance per power efficiency is 4926.10.
6 Conclusion
In the paper, we proposed the threshold neuron pruning which eliminates almost
part of the weight memory, which was a bottleneck of the conventional realiza-
tion. By applying the threshold neuron pruning, we could realize the weight
memory by on-chip memory on the FPGA. Thus, it operated with a high-speed
memory access. In the paper, we showed the SIPO fully connected layer cir-
cuit, which is efficiently access to on-chip memories on the FPGA. In the com-
parison, we measured the number of neurons for the original CNN, as for the
An FPGA Realization of a Deep CNN Using a Threshold Neuron Pruning 279
99% accuracy, the number of neurons decreased by 76.4%, while as for the 95%
accuracy, it decreased by 91.7%. That is, as for the 95% accuracy, the number
of BRAMs decreased by 96.2%, while as for the 95% accuracy, it decreased by
99.7%. We implemented the neuron pruning fully connected layer on the Digilent
Inc. NetFPGA-1G-CML FPGA board, and compared with the ARM Cortex A15
processor and the Kepler GPU. As for a delay time, the FPGA was 219.0 times
faster than the CPU and 12.5 times faster than the GPU. Also, a performance
per power efficiency was 125.28 times better than CPU and 17.88 times better
than GPU.
The future project is to apply the pruning technique to the binarized
CNN [19].
Acknowledgments. This research is supported in part by the Grants in Aid for Sci-
entistic Research of JSPS, and an Accelerated Innovation Research Initiative Turning
Top Science and Ideas into High-Impact Values program (ACCEL) of JST.
References
1. Anwar, S., Hwang, K., Sung, W.: Structured pruning of deep convolutional neural
networks. Computer Research Repository (CoRR), December 2015. https://arxiv.
org/ftp/arxiv/papers/1512/1512.08571.pdf
2. Caffe: Deep learning framework. http://caffe.berkeleyvision.org/
3. Chainer: a powerful, flexible, and intuitive framework of neural networks. http://
chainer.org/
4. Chakradhar, S., Sankaradas, M., Jakkula, V., Cadambi, S.: A dynamically con-
figurable coprocessor for convolutional neural networks. In: Annual International
Symposium on Computer Architecture (ISCA), pp. 247–257 (2010)
5. The CIFAR-10 data set. http://www.cs.toronto.edu/kriz/cifar.html
6. Ciresan, D.C., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for
image classification. In: Proceedings of CVPR (2012)
7. CUDA-Convent2: Fast convolutional neural network in C++/CUDA. https://
code.google.com/p/cuda-convnet2/
8. Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S.,
Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual
recognition and description. In: Proceedings of CVPR (2015)
9. Dundar, A., Jin, J., Gokhale, V., Martini, B., Culurciello, E.: Memory access opti-
mized routing scheme for deep networks on a mobile coprocessor. In: HPEC 2014,
pp. 1–6 (2014)
10. Farabet, C., Poulet, C., Han, J.Y., LeCun, Y.: CNP: an FPGA-based processor for
convolutional networks. In: FPL 2009, pp. 32–37 (2009)
11. Farabet, C., Martini, B., Akselrod, P., Talay, S., LeCun, Y., Culurciello, E.: Hard-
ware accelerated convolutional neural networks for synthetic vision systems. In:
ISCAS 2010, pp. 257–260 (2010)
12. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-
rate object detection and semantic segmentation. In: Proceedings of CVPR (2014)
13. Goodfellow, I.J., Bulatov, Y., Ibarz, J., Arnoud, S., Shet, V.: Multi-digit number
recognition from street view imagery using deep convolutional neural networks
(2013). arXiv preprint: arXiv:1312.6082
280 T. Fujii et al.
14. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural net-
works with pruning, trained quantization and Huffman coding. In: ICLR 2016
(2016)
15. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human
action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
16. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-
scale video classification with convolutional neural networks. In: Proceedings of
CVPR, pp. 1725–1732 (2014)
17. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
18. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann
machines. In: ICML, pp. 807–814 (2010)
19. Nakahara, H., Yonekawa, H., Sasao, T., Iwamoto, H., Motomura, M.: A memory-
based realization of a binarized deep convolutional neural network. In: The Inter-
national Conference on Field-Programmable Technology (FPT 2016), pp. 273–276
(2016)
20. Peemen, M., Setio, A.A.A., Mesman, B., Corporaal, H.: Memory-centric accelerator
design for convolutional neural networks. In: ICCD 2013, pp. 13–19 (2013)
21. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: ICLR 2015 (2015)
22. Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: DeepFace: closing the gap to human-
level performance in face verification. In: Proceedings of CVPR, pp. 1701–1708
(2014)
23. Theano. http://deeplearning.net/software/theano/
24. Torch: A scientific computing framework for LUTJIT. http://torch.ch/
25. Toshev, A., Szegedy, C.: DeepPose: human pose estimatiion via deep neural net-
works. In: Proceedings of CVPR (2014)
26. Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J., Tang, T., Xu, N.,
Song, S., Wang, Y., Yang, H.: Going deeper with embedded FPGA platform for
convolutional neural network. In: FPGA 2016, pp. 26–35 (2016)
27. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based
accelerator design for deep convolutional neural networks. In: FPGA 2015, pp.
161–170 (2015)
Accuracy Evaluation of Long Short Term
Memory Network Based Language Model
with Fixed-Point Arithmetic
1 Introduction
Language models, capturing the likelihood of words and phrases in text, are
widely used in natural language processing. It has been shown by prior research
that neural network based language models (NNLMs) [3,8] tend to outperform
many other advanced techniques because neural networks, such as Long Short
Term Memory (LSTM) Network [5], have the expressive ability to “remember”
the sequential information and patterns of sentences. However, the training and
prediction procedures require significantly more storage and computation cost,
which has limited the proliferation of their applications, especially in the field of
embedded systems [9].
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 281–288, 2017.
DOI: 10.1007/978-3-319-56258-2 24
282 R. Jin et al.
2 LSTM in a Nutshell
The core idea of vanilla LSTM [5] can be expressed by the equations as follows:
3 Experimental Methodology
In our experiments, we modified the floating-point versions of the vanilla LSTM
network into fixed-point versions to explore the effect of this modification and
the most suitable fixed-point solution for LSTMs. The fixed-point version took
bit-widths as parameters, including bit-widths of neural units, weights, activa-
tion functions, so it could run in any bit-width configuration. Only testing was
translated into a fixed-point version while training and verification were still
computed with the standard floating-point arithmetic. All fixed-point experi-
ments were conducted in Matlab2016a with Fixed-Point Designer toolbox. The
software version of the vanilla LSTM based language model and the correspond-
ing data set in our experiment were borrowed from Zaremba et al.’s work [10].
As the fixed-point simulation in Matlab was approximate 30 times slower than
floating-point computation, we chose the small scale and medium scale model in
[10] for illustration purpose. The floating-point training process was completed
by python under the Tensorflow framework [1].
The modification of the testing process consisted three main steps. Firstly, the
pre-trained weight and parameters of the model were converted to fixed-point
numbers. Secondly, when a word was fed to the model, all arithmetic opera-
tions such as matrix multiplication and element-wise operation were modified to
operate on fixed-point numbers. Thirdly, for hardware implementation, special
approximation should be applied to non-linear activations functions, which will
be thoroughly discussed in Sect. 5. During testing, we used perplexity (PPL),
which is a common metric for language prediction accuracy, to capture the qual-
ity of a sentence or a paragraph and the lower the PPL value is, the better the
language model is. The absolute error of our fixed-point modification was quan-
tified by the absolute value of the difference between the PPL output from the
original floating-point version and that from the fixed-point version. The error
rate was further calculated by dividing the absolute error by the PPL output
from the corresponding floating-point version.
All experiments were completed on a PC equipped with a 3.2 GHz AMD
CPU and 8 GB memory. The first 1000 words of the original testing set were
selected as the testing sample and fed into the model in sequence. Based on the
PPL errors generated under different fixed-point configurations, we explored bit-
width configurations from 8 bits to 32 bits in detail and found noticeable turning
points.
Overflow of the integer part will significantly affect the language model’s perfor-
mance because the integer part primarily determines the representation scope.
Thus, we need firstly figure out the most suitable integer length before investi-
gating the error of precision resulted from the shrinkage of the fractional part.
Before the modification, we firstly tested the floating-point model on the
selected testing sample and the PPL values were approximate 111.4334 for the
284 R. Jin et al.
small model and 79.5194 for the medium model. Then for all values in the model,
fixing the length of the fractional part as 16 bits and shortening the integer part
from 16 bits to 4 bits, we ran the fixed-point model on the same 1000 testing
words in sequence and compared the PPL output with the baseline PPL derived
from the floating-point version. The linear approximation of non-linear activation
functions has not been implemented in this section so far.
As is shown in Fig. 1, it is noticeable that the 4-bit integer was not wide
enough for both the small model and the medium model in terms of scope rep-
resentation. So the errors were extremely large and unstable. When the integer
width increased one or two bits, the representation scope was mostly satisfied.
Thus, the length of fixed-point numbers’ integer part in both models should
never be less than 5 bits. Otherwise, the limit of scope representation would lead
to disastrously huge error.
240 200
PPL of the medium scale model
PPL of the small scale model
220 180
200 160
180 140
160 120
140 100
120 80
100 60
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Length of the integer part Length of the integer part
Fig. 1. The PPL of both models under different fixed-point configurations where the
length of the fractional part is fixed as 16 bits and the length of the integer part is
shortened from 16 bits to 4 bits.
Then, with similar method, we fixed the integer length at 6 bits and shortened
the fractional part from 16 bits to 4 bits. Though the 5-bit long integer part was
wide enough for scope representation according to previous experiments, we still
decided to use 6-bit long integers to cover larger scope because there might exist
unexpected outliers. As is shown in Fig. 2, in order to guarantee the precision
during calculation, the length of the fractional part should be no less than 6 bits
and the corresponding error rate was approximate 1.95% for the small scale
model and 6.43% for the medium scale model.
In order to maintain the precision of the network and save the storage space
at the same time, based on our experiments, we believed that the 12-bit long
number with 6-bit fractional part was theoretically the best trade-off for both
models. In addition, both of the models showed similar trends and had exact the
same turning point when shortening the fixed-point numbers. This indicates that
the scale of the LSTM network had little influence on the choice of fixed-point
configuration. Thus, it can be inferred that our methodology and experimental
results are compatible with large scale models as well.
Accuracy Evaluation of LSTM Network Based Language Model 285
170 100
110 75
100 70
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Length of the fractional part Length of the fractional part
Fig. 2. The PPL of both models under different fixed-point configurations where the
length of the integer part is fixed as 6 bits and the length of the fractional part is
shortened from 16 bits to 4 bits.
Based on Table 1, with the configuration of 6-bit integer and different lengths
of fractional parts, we built a fixed-point version of both PLAs and Fig. 3 shows the
maximum and mean absolute value of errors of both PLAs. It turned out that, for
both PLAs, the errors were stable when the fractional part was longer than 9 bits.
286 R. Jin et al.
0.16 0.07
PLA1 PLA1
PLA2 PLA2
0.14 0.06
0.08 0.03
0.06 0.02
0.04 0.01
0.02 0
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Fractional part length Fractional part length
Fig. 3. Maximum (left) and mean (right) absolute errors of PLA1 and PLA2, config-
ured with 6 bits integer and varying lengths of the fractional part.
We also compared the performance between PLA1 and PLA2 when they were
applied to the language model. As is shown in Fig. 4, the length of the integer
part was fixed at 6 bits and the fractional part varied from 9 bits to 11 bits. The
absolute error of the PPL value was used to evaluate the performance of each
solution. The rest of the numbers, such as neuron values and connection matrix,
in the model were operated with 22-bit (6-bit integer part and 16-bit fractional
part) fixed-point numbers. Though the difference was minor, it was obvious that
PLA2 outperformed PLA1 at every fixed-point configuration. However, If the
hardware resource is limited, PLA1 is recommended as it consumes less.
3.5
PPL error compared with floating−point results
2.5
1.5
0.5
0
PLA1_6−9 PLA2_6−9 PLA1_6−10 PLA2_6−10 PLA1_6−11 PLA2_6−11
PLA type and bit−width configuration (Integer−decimal)
Fig. 4. Performance of the LSTM language model with different fixed-point configu-
rations using PLA1 (blue) and PLA2 (red). (Color figure online)
6 Mixed Bit-Widths
Modern FPGAs supply built-in primitives to support basic operations, such as
accumulation and multiplication, which are widely used in the LSTM network.
Moreover, these primitives have their own favorite bit-width that benefits the
hardware implementation. One DSP48E slice, for instance, contains one 25 × 18
two’s complement multiplier, an adder, and an accumulator. Although differ-
ent devices have their own favourite bit-width, 8-bit numbers or 16-bit numbers
Accuracy Evaluation of LSTM Network Based Language Model 287
are more suitable for the whole system due to two main reasons. Firstly, the
length that is integer multiples of the machine word-length usually leads to effi-
cient memory management and communication. Secondly, most ASICs employ
machine word-length numbers. With 8-bit or 16-bit numbers, it will be easier
for the FPGA-based system to cooperate with other ASIC-based systems. Even-
tually, we decided to use 8-bit fixed-point numbers to represent all parameter
matrixes of the model and 16-bit numbers for the rest.
With similar methods introduced in Sect. 4, we have also conducted a variety
of experiments to search for the best bit-width choice. The only difference was
that we used linear approximation of both the sigmoid function and the tanh
function in this section. The PPL error rate was used to quantify the performance
of the model and the results of the experiments are illustrated in Fig. 5.
0.16
Medium scale model
0.14 Small scale model
0.12
PPL error rate
0.1
0.08
0.06
0.04
0.02
0
4−4/6−10 4−4/7−9 4−4/8−8 5−3/6−10 5−3/7−9 5−3/8−8 6−2/6−10 6−2/7−9 6−2/8−8
Bit−width configuration (8−bit integer−decimal/16−bit integer−decimal)
Fig. 5. The PPL error rate of the model under different mixed bit-width configurations.
Based on these experiments, the 8-bit number with 3-bit fractional part along
with the 16-bit number with 9-bit fractional part was the best configuration for
both models and the corresponding PPL error rates for the small scale model
and the medium scale model were around 5.54% and 2.14% respectively. This
result proves again that the fixed-point configuration is insensitive to the scale
of the models.
7 Conclusion
Our work gives a comprehensive evaluation for implementing a LSTM network
based language model on FPGAs by studying a wide range of bit-width, achiev-
ing best performance and area efficiency. Theoretically, for both the small scale
model and the medium scale model, the 12-bit fixed-point configuration is the
best choice balancing the accuracy and storage saving, which indicates that the
scale of the model has little influence on the choice of fixed-point configurations.
Both PLAs of the tanh function are acceptable for the model and PLA1 is more
suitable if the hardware resource is limited while PLA2 is better if the model
needs to be more precise. Eventually, based on these results, in order to obtain
efficient memory management and communication, a mixed bit-widths solution
combing 8-bit numbers and 16-bit numbers is proposed and evaluated.
288 R. Jin et al.
References
1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S.,
Davis, A., Dean, J., Devin, M., et al.: Tensorflow: large-scale machine learning on
heterogeneous distributed systems (2016). arXiv preprint arXiv:1603.04467
2. Hesham Amin, K., Curtis, M., Hayes-Gill, B.R.: Piecewise linear approximation
applied to nonlinear function of a neural network. IEE Proc.-Circuits, Devices Syst.
144(6), 313–317 (1997)
3. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language
model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)
4. Chang, A.X.M., Martini, B., Culurciello, E.: Recurrent neural networks hardware
implementation on FPGA (2015). arXiv preprint arXiv:1511.05552
5. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8),
1735–1780 (1997)
6. Jiang, J., Rongdong, H., Mikel, L., Dou, Y.: Accuracy evaluation of deep belief
networks with fixed-point arithmetic. Comput. Model. New Technol. 18(6), 7–14
(2014)
7. Li, S., Chunpeng, W., Li, H., Boxun Li, Y., Wang, Q.Q.: FPGA acceleration
of recurrent neural network based language model. In: 2015 IEEE 23rd Annual
International Symposium on Field-Programmable Custom Computing Machines
(FCCM), pp. 111–118. IEEE (2015)
8. Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent
neural network based language model. In: Interspeech, vol. 2, p. 3 (2010)
9. Nurvitadhi, E., Sim, J., Sheffield, D., Mishra, A., Krishnan, S., Marr, D.: Acceler-
ating recurrent neural networks in analytics servers: comparison of FPGA, CPU,
GPU, and ASIC. In: 2016 26th International Conference on Field Programmable
Logic and Applications (FPL), pp. 1–4. EPFL (2016)
10. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization
(2014). arXiv preprint arXiv:1409.2329
FPGA Implementation of a Short Read
Mapping Accelerator
1 Introduction
Despite all the progress and improvements, due to massive amount of data that need
to be processed, still short read mapping is a time consuming process on modern
computers. To solve this problem many recent works try to accelerate short read
mapping on other platforms like FPGAs due to the high parallelism and customization
they provide. Researches such as [8–13] accelerate short read mapping using FPGAs.
In this work an FPGA-based fully pipelined accelerator for short read mapping is
proposed. The proposed hardware supports up to two mismatches in short read with 75
base-pairs (up to 100 bp). Our design uses the FM-index and the seed and compare
methods. The main concepts of our design are:
• Pre-calculated data along using one memory controller for top and bot pointers.
• Extracting three identical and non overlapping seeds from each short read in the
inexact match unit and comparing them with the reference.
• Through smart implementation, searching one of the three extracted seeds from
each short read is done in the exact match unit.
• A multi-core system is presented to maximize the efficiency of the design.
2 Related Works
In the following subsection we briefly discuss the FM-index approach and after that
review some recent short read mapping accelerators.
2.1 FM-index
To use FM-index method, the borrows-wheeler transform (BWT) [14] has to be
generated from the reference genome (Fig. 1a). The suffix array (SA) values show the
position of each suffix in the original reference stream (Fig. 1b). Using BWT stream,
the occurrence array O(x,i) and the characters count C(x) are generated from the BWT
(Fig. 1c). Then, searching any short read in the reference genome is done using Eqs. 1
and 2 with n steps, where n is the length of the short read.
The search operation uses two pointers named top and bot (bottom). These pointers
needs to be updated n times. To find the location of a short read in the reference
genome, top pointer is used as the address to read the SA values (Fig. 1d is an example
of searching GA in ACTGA). This is very important to note that finding SA values
using the top and bot values is not done in the FPGA accelerators and it is assumed that
this step is done in software. Also to reduce the memory size required to store the O(x,
i), the rows of O(x,i) are sampled with a factor of (d) and the rest of values (d−1 values)
are calculated online using the sampled values and the BWT.
Fig. 1. An example of generating the BWT, SA values, O(x,i) and C(x) from a reference
genome and finding GA in the reference.
3 Proposed Architecture
In this section the proposed architecture to implement short read mapping on FPGA is
discussed in details. The fully pipelined design consists of two main modules: the exact
match unit and the inexact match unit. Short reads enter the exact match unit and the
292 M. Morshedi and H. Noori
short reads that cannot be aligned in the exact match unit, are transferred to the inexact
match unit. The proposed inexact match unit does not use backtrack version of
FM-index. Instead, it extracts seeds from each short read and searches them in the
reference genome using simple FM-index. This section consists of four major
sub-sections: (1) Exact match unit architecture. (2) Pre-calculated values. (3) Inexact
match unit architecture. (4) Multi-core implementation of the design.
(a) Sampled
Pipeline (b)
O(x,i) and
O(x,i) Inexact match unit
BWT string Exact match unit
(Exact match unit) BRAM calculator
Original
Seed align
Exact match unit module
Top and bot O(x,i) values
Compare
Short module
Pre Exact match unit Seed align
Reads Short read Results
calculated module
aligner
BRAM and
With 9 slots Exact match unit
decoder
Fig. 2. Top view of exact match unit, inexact match unit and the quad-core design
Our design needs nine clock cycles to update a single top and bot (due to memory
latency, generating O(x,i) values and adding O(x,i) to C(x)). To hide the nine clock
cycles latency [10], we search nine short reads concurrently in the exact match unit.
While other short reads are waiting for new data (topnew & botnew ), new short reads
generate and send their requests for their corresponding tops and bots. As a result,
searching any short read with the length of n can be done in n clock cycles in average.
Another important consideration is the memory interface. Basically, to implement
Eqs. 1 and 2, two connections to two separate memories are needed, one for top (Eq. 1)
and one for bot (Eq. 2), respectively. If only one memory is used for both top and bot,
two memory accesses are needed to read Oðx; topold Þ (Eq. 1) and Oðx; botold Þ (Eq. 2)
values (and their BWTs). Therefore, the required memory size is decreased to the half
but the delay for reading O(x,i) becomes doubled and the speedup decreases by half. In
FM-index, top and bot have the maximum distance at the beginning. The distance
FPGA Implementation of a Short Read Mapping Accelerator 293
decreases in each search step. Hence, in many cases the top and bot will hit at the same
sampled O(x,i) and the original O(x,i) can be calculated for both top and bot, in one
clock cycle by reading one memory address.
Our design uses one memory for both top and bot instead of two to reduce the
number of memory interfaces. Through doing experiments for 10 K short reads we
learn that after first seven steps (in average) the top and bot can be calculated using the
same sampled O(x,i) which requires one memory access (around 13 steps when the
reference is the whole human genome [12]). With this method (using one memory for
top and bot instead of two), the number of memory controllers is reduced to one for
each exact match unit. However, the speedup for each exact match unit with one
memory controller decreases at most by 0.15x for short reads with the length of 75
compared to the exact match unit with two memory controllers.
much more than its effect in the exact match unit in terms of speedup. For the seeds
with 25 base pairs, the speed-up while using pre-calculated data is 2x (m = 9).
Another enhancement used in our design, is that the seed aligner needs to search
only two seeds instead of three seeds. In our design, additional counters and registers
are added to the exact match unit so that when the search steps reach to the one third of
the short read, the related top and bot are stored in a register. These data is sent to the
inexact match unit. The seed aligner in inexact match unit reuses these data and
therefore, does not need to search one of the seeds again. Using this technique and
pre-calculated data, searching the seeds in the seed aligner module becomes 3x faster,
which is always the slowest module in the inexact match unit pipeline stages.
Table 1. Area and BRAM usage and the run time for searching one million short reads.
LUT Register 32 Kb BRAM Run time (sec)
Quad core design 29554 (19%) 31091 (10%) 361 (87%) 0.095
According to the recent works discussed in Sect. 2, searching the smaller per-
centage of the short reads which contains mismatches is the most time consuming part
in short read mapping. Using the optimization techniques proposed in our design,
searching the short reads with mismatches has become much faster than searching the
short reads in the exact match unit.
Table 2. Comparing software and FPGA run time for searching 100 thousand short reads.
Number of threads Clock freq. (MHZ) Run time (sec) Speed up
AMD FX9590 8 4600 0.39 41
Intel core-i7 5820 k 12 3300 0.18 19
5 Conclusion
multithreading, pipelining and using pre-calculated data. Our paper uses a modified
seed and compare version of FM-index to align short reads with 75 bp (up to two
mismatches) which does not use the backtrack version of FM-index which is more
complex.
References
1. Mardis, E.R.: The impact of next-generation sequencing technology on genetics. Trends
Genet. 24(3), 133–141 (2008)
2. Wetterstrand, K.: DNA sequencing costs, data from the NHGRI Genome Sequencing
Program (GSP) (2014). http://www.genome.gov/sequencingcosts
3. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol.
Biol. 147(1), 195–197 (1970)
4. Altschul, S.F., et al.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
5. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics 25(14), 1754–1760 (2009)
6. Liu, C., et al.: SOAP3: ultra-fast GPU-based parallel alignment tool for short reads.
Bioinformatics 28(6), 878–879 (2012)
7. Ferrragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proceeding
of 12th ACM-SIAM Symposium on Discrete Algorithms, pp. 269–278 (2001)
8. Olson, C.B., et al.: Hardware acceleration of short read mapping. In: 2012 IEEE 20th Annual
International Symposium on Field-Programmable Custom Computing Machines (FCCM),
pp. 161–168. IEEE (2012)
9. Fernandez, E., Najjar, W., Lonardi, S.: String matching in hardware using FM-index. In:
2011 IEEE 19th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), pp. 218–225. IEEE (2011)
10. Arram, J., Tsoi, K.H., Luk, W., Jiang, P.: Reconfigurable acceleration of short read mapping.
In: 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), pp. 210–217. IEEE (2013)
11. Arram, J., Luk, W., Jiang, P.: Ramethy: reconfigurable acceleration of bisulfite sequence
alignment. In: Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, pp. 250–259. ACM (2015)
12. Arram, J., et al.: Leveraging FPGAs for accelerating short read alignment. IEEE/ACM
Trans. Comput. Biol. Bioinform. (2016). http://ieeexplore.ieee.org/document/7422003/
13. Xin, Y., et al.: Parallel architecture for DNA sequence inexact matching with
Burrows-Wheeler Transform. Microelectron. J. 44(8), 670–682 (2013)
14. Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Digital
Equipment Corporation. Technical report (1994)
15. UCSC Genome Bioinformatics. http://hgdownload.cse.ucsc.edu
Languages and Estimation Techniques
dfesnippets: An Open-Source Library
for Dataflow Acceleration on FPGAs
Paul Grigoras(B) , Pavel Burovskiy, James Arram, Xinyu Niu, Kit Cheung,
Junyi Xie, and Wayne Luk
1 Introduction
Highly tuned FPGA implementations can achieve performance and power effi-
ciency gains for many problems [1]. However, development productivity is lim-
ited compared to other acceleration alternatives such as GPUs or Xeon Phi
processors [2].
Recently, higher-level programming facilities based on High Level Synthe-
sis [3,4] or domain specific languages [5–7] have improved productivity of
FPGA development significantly. High-quality standard development libraries
are becoming essential to improve productivity further. However, FPGA devel-
opment environments may not provide standard development libraries. Funda-
mental operations such as floating point reductions may not be supported, and
depending on the available resources and desired performance are nontrivial to
implement, as we show in Sect. 2.
It is therefore necessary to provide well-designed component libraries to facil-
itate the development of applications and tools. However, in addition to these
facilities, and as a point of departure from conventional approaches, given the
performance-critical nature of the FPGA environment, component and applica-
tion benchmarks should also be part of the library to facilitate the development
of high-performance designs. To increase developer productivity for FPGA accel-
erators at all levels, libraries might provide: (1) library components which serve
as the building blocks for developing real-world applications. These library com-
ponents should be efficient in terms of latency, throughput and resource usage,
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 299–310, 2017.
DOI: 10.1007/978-3-319-56258-2 26
300 P. Grigoras et al.
and provide a useful and customisable interface; (2) benchmarking utilities which
aid in tasks such as determining system performance and resource utilisation.
These utilities are essential for rapid prototyping, and assessing the scalability
and feasibility of FPGA designs; (3) applications which can be used as bench-
marks, or case studies for framework and tool development. These applications
can also be adapted to accelerate closely related problems, considerably reducing
development time.
In this work we present dfesnippets,1 the first community driven open-
source library for Maxeler DataFlow Engines (DFEs). The library is available
under the MIT License. Table 1 provides an overview of the components:
1
https://github.com/custom-computing-ic/dfe-snippets.
dfesnippets: An Open-Source Library for Dataflow Acceleration on FPGAs 301
Although we will not cover this in greater detail due to lack of space,
the library also contains: (1) header only C++ libraries implementing useful
functionality for managing and benchmarking DFE projects ranging from timing
utilities to APIs for reordering sparse matrix data in preparation for FPGA exe-
cution; (2) tools for creating and managing projects such as to compile, generate
and manage multiprocess and multi-node hardware compilation, and automati-
cally extract and tabulate resource usage and generate reports; (3) comprehen-
sive, automated test suite, testing each design to ensure it is functionally correct
and it meets timing and resource usage constraints.
A community driven library of open-source implementations, component and
application benchmarks can increase the productivity of researchers and profes-
sional programmers. It can also improve the quality of results, and pave the way
for broader FPGA adoption in areas where productivity has been a key limiting
factor, such as High Performance Computing.
2 Library Components
Library components are the building blocks for developing more complex real
world applications. dfesnippets includes a range of components such as generic
reduction, I/O blocks, linear algebra blocks (sparse product, matrix vector and
matrix-matrix-multiply, power iteration kernels, sparse matrix vector multipli-
cation for banded matrices), and generic configuration and connectivity utilities
such as inter-FPGA communication blocks. Despite being fundamental compo-
nents, they are challenging to implement on FPGAs due to the resource con-
strained nature and high emphasis on performance and resource efficiency.
To be used effectively in large scale designs, library components must be para-
metric, provide a useful interface, and be efficient in terms of latency, throughput,
and resource utilisation. Pure encapsulation, in software terminology, is difficult
to achieve, therefore the internals of many cores may have to be customised in
order to fit into the resource and performance constraints of a particular appli-
cation. This makes source code availability important for component reuse.
We implement dfesnippets for the Maxeler FPGA platform [21]. The plat-
form constitutes of a hardware implementation, a compiler from a high-level
dataflow language, MaxJ, to FPGA bitstream, and a runtime environment.
MaxJ [22] provides explicit control of the design of the hardware architecture
itself, which is critical in delivering good performance and effectively exploit-
ing customisation opportunities available for FPGA designers. It is conceptu-
ally close to Verilog, but with increased productivity due to the abstraction
of low-level vendor IPs; MaxJ provides good support for software-only simula-
tion and interfacing with many available programming languages. These features
make MaxJ a good choice for implementing an open-source library: it provides
a high level of control and flexibility without being verbose while the similarity
to other hardware description languages simplifies porting components to other
languages. In the rest of this Section we provide a more in-depth look at certain
components of dfesnippets. For a full list please see the project page2 .
2
https://github.com/custom-computing-ic/dfe-snippets.
302 P. Grigoras et al.
2.1 Reductions
signal high regardless of their internal state, thus finalising the reduction with
whatever number of inputs are internally present in our PCBT circuit. This
enables accumulating an arbitrary number of terms in a reduction set.
1
reset
1 1
enable register register register valid
SM SM SM
+
+
1 k
input output
To conclude the reduction case study, we note that, in principle, the reduc-
tion operation is probably one of the most fundamental building blocks required
for implementing more complex applications. However, due to the broad range
of design choices with varying throughput, resource utilisation and functionality,
this operation is not trivial to implement. Having easy access to multiple vari-
ants of reduction circuit, as provided by dfesnippets, can therefore improve
productivity.
I/O blocks are commonly used to manage the connection between the compu-
tational kernel implemented on the FPGA accelerator and off-chip components
such as DRAM, the host CPU (PCIe, Infiniband), or other FPGA devices.
In the case of DRAM and CPU communication, the I/O blocks may be
required to convert the fixed width output interface of the communication chan-
nel to a different input width of the computational kernel. This is a common
requirement, particularly for applications which process an irregular, runtime
dependent input size at each cycle such as a sparse matrix vector multiplication
kernel [8]. The I/O blocks are required to be efficient from a resource utilisation
perspective but the logic they implement is often complex and the control heavy
nature does not map well to dataflow style accelerators and languages. If unop-
timised, these blocks can use substantial on-chip resources, particularly memory
resources such as BRAMs.
Blocks such as the Arbitrary Length Burst Proxy (ALBP) included in
dfesnippets and used in previous work [8] can help address these issues. The
ALBP architecture contains k FIFOs to store bursts retrieved from off-chip mem-
ory. Once a burst is retrieved, data are pushed in the FIFOs such that the i-th
element of a burst is assigned to FIFO o+i mod k. o is the position after process-
ing the previous burst. mt < k data items may be simultaneously requested from
the ALBP by the compute kernel, where mt is runtime-determined. If fewer than
k items are requested, the output is zero-padded to the fixed width k to match
with the fixed, regular k width of the compute kernel’s input interface.
304 P. Grigoras et al.
3 Benchmarking
Benchmarking utilities are especially helpful for the research community. They
help establish a baseline for the system performance or resource efficiency, facil-
itate quick estimation and prototyping (for example to assess the scalability of
various designs with respect to memory bandwidth, resources etc.), provide san-
ity checks and highlight empirically the impact of some optimisations which may
not be entirely transparent to the end user. Two types of benchmarks are particu-
larly important for FPGA development: (1) performance benchmarks which can
be used to measure the throughput and latency of FPGA designs and memory
and interconnect subsystems (2) resource utilisation benchmarks which demon-
strate the resource efficiency of particular cores and are essential for assessing
the scalability and feasibility of FPGA designs.
Performance. dfesnippets provides three system level performance bench-
marks which can be used to measure the achievable throughput of various links.
The Default DRAM Benchmark instantiates a default memory controller, with
dfesnippets: An Open-Source Library for Dataflow Acceleration on FPGAs 305
customisable clock frequency which reads and writes data in a linear access fash-
ion. This can be used to determine the peak memory bandwidth performance
of a given device, which can serve as a baseline for measuring the achieved
performance of user applications. The Custom DRAM Benchmark instantiates
a more complex design with a custom memory command generator and asso-
ciated host code to drive the benchmarking. This can be used for evaluating
the memory access speed using custom memory commands and linear access
patterns. It fetches parallel data streams from DRAM and then routes them to
DRAM and/or host, behaviour which is configurable by the user. The major con-
figuration options are parameterised so users can change the number of bursts
per command, size of memory to access, width of memory interface and number
of parallel DRAM streams to match existing properties in their own designs.
This enables rapid experimentation with application specific data placement
and access scheduling techniques to improve DRAM performance. The Infini-
band/PCIe DRAM Benchmark instantiates a simple pass through design which
matches the PCIe input width (128 bits). Together with the associated software
to run on the CPU, the design can be used to measure throughput over the CPU
to FPGA interconnect.
The library allows users to easily adjust the number of measurements, data
size, memory controller frequency, on-chip frequency, and architecture for each
benchmark. This reduces the possibility for error and promotes good practices.
Resource Utilisation. dfesnippets includes a synthetic resource utilisation
benchmark to measure the resource usage of various blocks using the MaxCom-
piler builtin resource usage annotations. These reports are openly available as
part of the library and can provide the basis for rapid resource usage estimation
models without the need to sit through long compilation times. This can greatly
reduce the time to prototype designs. The benchmarks are provided for both
the Xilinx Virtex 6 based Vectis boards and the Stratix V Maia boards. This
provides a quick method to highlight differences between the two (such as dif-
ferent resource usage profile of DSPs) or provide insight into hidden properties,
which can probably only be discovered by significant empirical exploration, such
as the considerable resource savings achieved by reducing pipelining factors on
the Stratix V Maia boards.
4 Applications
dfesnippets also includes a set of full applications which can be used as reusable
components in other applications or as benchmarks and case studies for frame-
work and tool development. The broader availability of such applications can
help researchers and developers focus more on their area of expertise and avoid
typical pitfalls stemming from the complexity of designing FPGA based appli-
cations. These applications themselves contain reusable blocks which can be
adapted in other designs, or can be reused directly in other applications, per-
haps as one stage of a complex pipeline or multi FPGA design. Overall, the
availability of these larger designs can increase the productivity of researchers
and tool developers.
306 P. Grigoras et al.
The QPI phase retrieval and cell image classification design is composed
of a spatial domain module, a frequency domain module and a linear SVM
classifier. The spatial domain module performs background subtraction, intensity
normalization and complex phase shift extraction. The frequency domain module
performs low-pass filtering to reduce noises and retrieves final phase images. The
Winograd 16-point algorithm is used in the frequency domain module to perform
forward and inverse 2D fast Fourier transform (FFT). The sequential Winograd
algorithm has low resource consumption and is suitable for a wide range of
applications involving frequency spectrum analysis.
The QPI application has a throughput of 32.08 GOPS when running on a
single Altera Stratix V GS 5SGSD8 FPGA [19], which is equivalent to retriev-
ing and classifying around 2497 phase images of 256 × 256 size. Classification
accuracy of unstained and live human chondrocytes (OAC), human osteoblasts
(OST) and mouse fibroblasts (3T3) increases when using retrieved phase images.
5 Evaluation
dfesnippets totals approximately 6000 lines of CPU utilities and tests and 7000
lines of MaxJ in the library and benchmarking components and 4000 lines of CPU
and MaxJ code in the applications components. We estimate the development
time of each library component to be of the order of one to two weeks while
the development effort for applications is on the order of 1–2 months. Both
library and application development usually involve two developers, of which
one is typically experienced (more than two years) in the MaxJ programming
language.
Even in a relatively high level language such as MaxJ, approximately 600
lines of library code including comments are required to implement the three
alternative reduction strategies described in Sect. 2.1 plus an additional 700
lines for setting up the CPU test bench that is vital to verify the correctness
of these implementations, particularly for the more complex designs. By using
dfesnippets almost 1300 lines of code can be replaced by several lines to instan-
tiate the required reduction circuits directly in the user design. Therefore the
productivity gains resulting from the proposed library component of our app-
roach are substantial, particularly since reduction circuits are generic blocks,
commonly used in many applications. Table 2 shows several applications where
we have used dfesnippets and observed a substantial reduction in source lines
of code (SLOC) for the hardware design.
To illustrate the productivity gains achievable by the applications compo-
nents we note that recent software frameworks such as experimental compil-
ers [12] and resource management frameworks [13] for FPGA based systems can
utilise these applications directly as benchmarks. Prototypes for these projects
require 4123 and 3880 lines of code respectively, while the benchmarks require
2050 and 2924 lines of code respectively. Therefore a substantial productivity
gain comes from the ability to directly reuse these benchmarks and avoid spend-
ing substantial time on redeveloping complex designs. We estimate the develop-
ment time of application components in dfesnippets to be between 1–2 months
308 P. Grigoras et al.
each for an experienced MaxJ developer. These applications often require com-
plex, specialised and state of the art blocks such as high throughput random
number generators, Fast-Fourier Transforms, and custom memory controllers.
Such blocks are not only complex and non-trivial to optimise for FPGA imple-
mentation, they are also difficult to develop and debug. It is clear that from a
tool developer perspective, it is not productive to spend as much time developing
the benchmark as developing the tool itself.
Not only is the development time reduced substantially by avoiding the need
to redevelop benchmarks, but the parametric design supports customisation
effectively, leading to additional productivity gains. All applications can be built
with minimal configurations to verify correctness or with full replication and
optimisations to verify performance and energy efficiency. This approach simpli-
fies debugging and testing in the early stages of project development by reducing
the compilation time.
6 Conclusion
References
1. Todman, T.J., Constantinides, G.A., Wilton, S.J., Mencer, O., Luk, W.,
Cheung, P.Y.: Reconfigurable computing: architectures and design methods. IEE
Proc.-Comput. Digit. Tech. 152(2), 193–207 (2005)
2. Jones, D.H., Powell, A., Bouganis, C., Cheung, P.Y.: GPU versus FPGA for high
productivity computing. In: Proceedings of the FPL, pp. 119–124 (2010)
3. Zhang, Z., Fan, Y., Jiang, W., Han, G., Yang, C., Cong, J.: AutoPilot: a platform-
based ESL synthesis system. In: Coussy, P., Morawiec, A. (eds.) High-Level Syn-
thesis, pp. 99–112. Springer, Heidelberg (2008)
4. Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J.H.,
Brown, S., Czajkowski, T.: LegUp: high-level synthesis for FPGA-based proces-
sor/accelerator systems. In: Proceedings of the FPGA, pp. 33–36. ACM (2011)
5. Kulkarni, C., Brebner, G., Schelle, G.: Mapping a domain specific language to a
platform FPGA. In: Proceedings DAC, pp. 924–927. ACM (2004)
6. George, N., Lee, H., Novo, D., Rompf, T., Brown, K.J., Sujeeth, A.K.,
Odersky, M., Olukotun, K., Ienne, P.: Hardware system synthesis from domain-
specific languages. In: Proceedings of the FPL, pp. 1–8. IEEE (2014)
7. Cong, J., Sarkar, V., Reinman, G., Bui, A.: Customizable domain-specific comput-
ing. IEEE Des. Test Comput. 28(2), 6–15 (2011)
8. Grigoras, P., Burovskiy, P., Luk, W.: CASK: open-source custom architectures for
sparse kernels. In: Proceedings of the FPGA, pp. 179–184 (2016)
9. Grigoras, P., Burovskiy, P., Hung, E., Luk, W.: Accelerating SpMV on FPGAs by
compressing nonzero values. In: Proceedings of the FCCM (2015)
10. Chow, G., Grigoras, P., Burovskiy, P., Luk, W.: An efficient sparse conjugate gra-
dient solver using a benes permutation network. In: Proceedings of the FPL (2014)
11. Burovskiy, P., Grigoras, P., Sherwin, S.J., Luk, W.: Efficient assembly for high
order unstructured FEM meshes. In: Proceedings of the FPL (2015)
12. Grigoras, P., Niu, X., Coutinho, J., Luk, W., Bower, J., Pell, O.: Aspect driven
compilation for dataflow designs. In: Proceedings of the ASAP (2013)
13. Grigoras, P., Tottenham, M., Niu, X., Coutinho, J.G.F., Luk, W.: Elastic man-
agement of reconfigurable accelerators. In: Proceedings of the ISPA, pp. 174–181.
IEEE (2014)
14. Coutinho, J.G.F., Pell, O., O’Neill, E., Sanders, P., McGlone, J., Grigoras, P.,
Luk, W., Ragusa, C.: HARNESS project: managing heterogeneous computing
resources for a cloud platform. In: Goehringer, D., Santambrogio, M.D., Cardoso,
J.M.P., Bertels, K. (eds.) ARC 2014. LNCS, vol. 8405, pp. 324–329. Springer, Hei-
delberg (2014). doi:10.1007/978-3-319-05960-0 36
15. Arram, J., Pflanzer, M., Kaplan, T., Luk, W.: FPGA acceleration of reference-
based compression for genomic data. In: Proceedings of the ICFPT, pp. 9–16.
IEEE (2015)
16. Arram, J., Luk, W., Jiang, P.: Ramethy: reconfigurable acceleration of bisulfite
sequence alignment. In: Proceedings of the FPGA, pp. 250–259. ACM (2015)
17. Burovskiy, P., Girdlestone, S., Davies, C., Sherwin, S., Luk, W.: Dataflow acceler-
ation of Krylov subspace sparse banded problems. In: Proceedings of the FPL, pp.
1–6. IEEE (2014)
18. Grigoras, P., Burovskiy, P., Luk, W., Sherwin, S.: Optimising sparse matrix vector
multiplication for large scale FEM problems on FPGA. In: Proceedings of the FPL,
pp. 1–9. EPFL (2016)
310 P. Grigoras et al.
19. Xie, J., Niu, X., Lau, A.K., Tsia, K.K., So, H.K.: Accelerated cell imaging and
classification on FPGAS for quantitative-phase asymmetric-detection time-stretch
optical microscopy. In: Proceedings of the ICFPT, pp. 1–8. IEEE (2015)
20. Arram, J., Tsoi, K.H., Luk, W., Jiang, P.: Hardware acceleration of genetic
sequence alignment. In: Brisk, P., Figueiredo Coutinho, J.G., Diniz, P.C. (eds.)
ARC 2013. LNCS, vol. 7806, pp. 13–24. Springer, Heidelberg (2013). doi:10.1007/
978-3-642-36812-7 2
21. Lindtjrn, O., Clapp, R.G., Pell, O., Mencer, O., Flynn, M.J.: Surviving the end of
scaling of traditional micro processors in HPC. In: IEEE HOT CHIPS 22 (2010)
22. Pell, O., Mencer, O.: Surviving the end of frequency scaling with reconfigurable
dataflow computing. SIGARCH Comput. Archit. News 39(4), 60–65 (2011)
23. Morris, G.R., Zhuo, L., Prasanna, V.K.: High-performance FPGA-based general
reduction methods. In: Proceedings of the FCCM, pp. 323–324 (2005)
24. Zhuo, L., Morris, G.R., Prasanna, V.K.: Designing scalable FPGA-based reduction
circuits using pipelined floating-point cores. In: Proceedings of the ISPDP (2005)
25. Wilson, D., Stitt, G.: The unified accumulator architecture: a configurable,
portable, and extensible floating-point accumulator. Trans. Reconfigurable Tech-
nol. Syst. (TRETS) 9(3), 21 (2016)
26. Zhuo, L., Morris, G.R., Prasanna, V.K.: High-performance reduction circuits using
deeply pipelined operators on FPGAs. IEEE Trans. PDS 18(10), 1377–1392 (2007)
27. Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In:
Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algo-
rithms, pp. 269–278. Society for Industrial and Applied Mathematics (2001)
28. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat.
Methods 9(4), 357–359 (2012)
29. Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using com-
pressed data structures. Genome Res. 22(3), 549–556 (2012)
30. Zhang, Y., Li, L., Yang, Y., Yang, X., He, S., Zhu, Z.: Light-weight reference-based
compression of FASTQ data. BMC Bioinform. 16(1), 1 (2015)
31. Burrows, M., Wheeler, D.J.: A Block-sorting Lossless Data Compression Algorithm
(1994)
32. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches.
SIAM J. Comput. 22(5), 935–948 (1993)
33. Mitchell, A.R., Griffiths, D.F.: The Finite Difference Method in Partial Differential
Equations. Wiley, Hoboken (1980)
34. Thomas, D.B., Luk, W.: High quality uniform random number generation using
LUT optimised state-transition matrices. Vlsi Sig. Process. 47(1), 77–92 (2007)
A Machine Learning Methodology for Cache
Recommendation
1 Introduction
The cache memory is a critical component of modern processors because it avoids
the latency generated by accessing the main memory. The quality of service of
this element changes with the memory demands of the software running on the
system. An application might benefit more from a cache configuration which
is highly inefficient for another application. Moreover, this component can con-
sume a significant percentage of the system’s power consumption [13]. Therefore,
choosing the right cache system that fulfills the application’s memory require-
ments with the minimum resources would not only improve performance, but it
would also provide energy savings. This is particularly important in the case of
embedded systems, which usually have high area and energy constraints while
also demanding high performance.
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 311–322, 2017.
DOI: 10.1007/978-3-319-56258-2 27
312 O. Navarro et al.
Furthermore, it is not a trivial task to choose the optimal cache for a sys-
tem. Normally, the architecture designer would choose a cache configuration such
that it yields an average performance and energy consumption for a series of pro-
grams, which leads to non-optimal cache performance for a specific application.
Usually, configuring a cache parameter involves a tradeoff between performance,
cost and energy consumption. For example, on the one hand, having a small block
size enables a faster data transfer from the main memory to the cache than a
larger block size. On the contrary, having a large block size favors the spatial
locality of the data, as data with consecutive addresses are usually accessed
sequentially, therefore requiring fewer data transfers. However, if the block size
is too large, unnecessary data will be transferred, which may also decrease the
performance.
Furthermore, the design space exploration involves evaluating the effect that
the different cache parameters (e.g. cache size, associativity, line size, replace-
ment policy, among others) have on the performance. This is usually done using
simulators such as Gem5 [3], Simplescalar [4], SMPCache [16], Dinero IV [5],
etc., and even data mining techniques [11].
Statistical classification is a Machine Learning technique which, given a set
of observations, it aims to identify to which of a set of predefined categories or
classes each of these observations belong. Examples of the applications of this
technique are: classifying emails into Spam or No Spam classes, classifying texts
into different literary genres, classifying symptoms of medical patients into possi-
ble diseases, and so on. Furthermore, one can define the problem of cache design
into a statistical classification problem. On cache design, one wants to determine
which cache configuration (or cache class) is optimal for every application (or
observation). Thus, using statistical classification techniques can be an effective
solution to narrow down the design space for cache design.
In this paper, we propose a methodology for predicting the optimal cache
configuration for a program given as input. The methodology uses classification,
a supervised machine learning technique. Our methodology starts by obtaining
the execution trace of the input program, then generating a feature vector which
contains the information regarding the frequency of a subset of the dynamic
instructions and then feeding this vector to a series of previously trained clas-
sification models. The models take the feature vector as input and output a
cache configuration predicted to be optimal for this program, concerning energy
consumption and performance.
We evaluated our methodology with 488 applications using different input
data. Our results show that our methodology reaches approximately a 99.8%
precision rate. Furthermore, a deeper analysis of our results indicates that the
misclassified programs were in 75% of the cases still assigned a suboptimal cache
configuration that increased the energy consumption up to only 10% in com-
parison with the optimal cache configuration. These results suggest that our
methodology is a promising technique to narrow the design space exploration
when choosing the right cache memory for a set of applications.
A Machine Learning Methodology for Cache Recommendation 313
2 Related Work
There have been several approaches that use data mining techniques for com-
puter architecture design. For example, there are approaches which focus on
improving branch prediction [10,18] or dynamic resource allocation [7].
Regarding cache memories, there have been very few data mining approaches
to recommend parameters of the cache system. CHIDDAM [6] is a methodology
based on a decision tree combined with a greedy algorithm to determine the best
cache hierarchy, i.e. number of cache levels, and the size of each cache level. This
methodology simulates a number of applications, then, an algorithm scans the
simulation data iteratively and determines the best number of cache levels, based
on the performance of the system. Moreover, a decision tree is built from the sim-
ulation data, which determines the contribution of each cache level to the over-
all performance. Finally, a greedy algorithm is used to find the best cache size
of each level. Unfortunately, this approach was evaluated with only two applica-
tions, so it remains uncertain if the methodology works for other applications.
[11] proposed a methodology to predict the block size of a cache for data min-
ing applications. The methodology uses the tool Pin [12] to obtain memory traces
of the applications. These memory traces were divided into blocks of 10 million
traces each and then fed to SMPCache [16], a cache simulator. The miss rates
of each block were obtained and used to measure the performance of the chosen
cache configuration. Then a feature dataset was created using the frequencies of
co-occurrences of memory traces as well as the frequencies of the traces with con-
tiguous memory addresses. The feature dataset was then used to train a neural
network. This methodology was also evaluated with very few applications, in this
case three; therefore it is not clear how effective it would be for any other data min-
ing application. In this paper, we focused on the first cache level of an embedded
processor, and considered 3 parameters in our methodology: cache size, line size
and associativity. Instead of focusing only on cache misses to measure the quality
of our methodology, we use a model that considers a Pareto optimal point between
energy consumption and cache hits. Furthermore, to provide a good evaluation,
we used a much larger dataset, consisting of 488 applications.
314 O. Navarro et al.
3 Proposed Methodology
In this work, we present a methodology based on machine learning to determine
the near-optimal cache configuration for a given application, without the need
to profile its execution on actual hardware. Figure 1 shows an overview of the
proposed methodology, in three stages: (a) Database generation; (b) Classifier’s
training; (c) Test and refinement; each one explained as follows.
The methodology starts with the profiling of several benchmarks over a stan-
dard processing architecture, to generate the Profiling Database. The profile goals
were: (a) Hits and Miss rates, (b) Application’s Energy Consumption, (c) Total
execution time. Each benchmark’s application was run with different cache con-
figurations to determine how sensitive the profiling goals are to each configured
parameter and to determine the optimal cache configuration. For this paper, we
focused on the level 1 data cache, and considered the following parameters: cache
size, line size and associativity. Table 1 shows the parameter values considered.
ECache = Es + Ed (1)
Fig. 2. Profile database generated from the N-gram statistics and energy profiles.
Application Hits Misses Energy (nJ) Exec. time (s) N-gram Frequency
App 1 4387 1244 352 34
Ng 1 132
Ng 2 75
... ...
Ng n 21
... ... ... ... ... ... ...
App z 1063 957 43 210
Ng 1 7
... ...
Ng m 379
Fig. 3. Selecting the best classification function generated by the learned classifiers.
After the end of the training, the execution tests are used to determine the
quality of the prediction functions created by the classifiers. For a new given
application, we extract the N-gram’s frequencies and check whether each classi-
fier generates or not the desired near-optimal solution, Fig. 4.
4 Experimental Setup
We used Gem5 [3] to obtain the dynamic instructions used to generate the fea-
tures to train our model and also to get the optimal cache configurations. Gem5
is a platform for architecture research, which enables cycle accurate processor
simulation using several different cache parameters. Table 3 shows the configu-
ration used for Gem5.
We used CACTI 4.1 [15] to obtain several values employed by the energy
model to calculate the power consumption of a cache. CACTI is a cache model
which provides estimations regarding access time, cycle time, area, etc. We use
the API from Weka 3.8 [9] to train and evaluate our model. Weka is a data
A Machine Learning Methodology for Cache Recommendation 317
Fig. 4. Using the best classification function to recommend a cache configuration for
a given unknown application.
Parameter Value
System clock 1 GHz
Memory Mode: timing accesses, address range: 4096 MB
Cache memory bus Coherent XBar
System memory bus Coherent XBar
Interrupt controller Directly connected to the bus and not cached
DDR3 memory controller DDR 1600 x64
mining tool set developed in Java that includes several machine learning algo-
rithms, filters, and evaluation tools. Table 4 shows the classifiers used to evaluate
our approach. Each experiment was run with all the classifiers. The execution
time of the classifiers, including training and evaluation phases, was of 6 min in
average. To evaluate the classifiers, we used the measures precision, recall and
F-measure, which are commonly used in statistical classification. These measures
are calculated as shown in Eqs. 5, 6 and 7, where tp represents the number of
true positives, f p refers to the number of false positives and f n represents the
number of false negatives.
tp
P recision = (5)
tp + f p
tp
Recall = (6)
tp + f n
P recision ∗ Recall
Fmeasure = 2 ∗ (7)
P recision + Recall
318 O. Navarro et al.
Type Classifier
Bayes BayesNet, naive bayes, naive bayes multinomial
Functions Multilayer perceptron, simple logistic, SMO
Lazy LBK, LWL, KStar
Meta AdaBoostM1, attribute selected classifier, bagging, CV
parameter selection, filtered classifier, iterative classifier
optimizer, LogitBoost, multiclass classifier, multischeme,
multischeme, random committee, random subspace, ran-
domizable filtered classifier, stacking, vote
Misc Input mapped classifier
Rules Decision table, JRip, PART, OneR, ZeroR
Trees Decision stump, hoeffding tree, J48, LMT, random forest,
random tree, REPTree
Regarding the applications used to train and evaluate our model, we built
a data set of 488 programs from the miBench [8] benchmark suite and from a
group of C programs provided in the Florida State University’s Website [1]. The
applications’ domains range from arithmetic programs, route planning, image
processing, etc. We built a script to automatically compile and simulate each
program with Gem5 using each cache configuration from Table 1.
performing classifiers for this data set. The best classifier was RandomSubSpace
which obtained an F-Measure of 0.683 (an F-Measure closer to 1 is better).
10% in comparison with the optimal cache configuration. 12% of the misclassified
applications would generate an increase in energy consumption from 11% to 25%,
what suggests that even though the model misclassified these instances, it chose
cache configurations which are close in energy efficiency to the optimal cache
configurations.
that on 85% of the misclassified programs the energy consumption of the mis-
classified applications increased only up to 10% in comparison with the optimal
cache configuration and on 12% of the misclassified programs this growth was
from 11% to 25%. These results are promising and open several paths of research
to improve the precision of our methodology and further narrow down the design
space for cache memories.
As future work, we will carry out a deep analysis of the code of the misclas-
sified instances to obtain specific information about the reasons for the misclas-
sification and generate new features from this information to include into our
model. We will also add more applications from different domains to our data
set to train a more robust model and have a more accurate evaluation.
References
1. C source codes benchmark. http://people.sc.fsu.edu/∼jburkardt/c src/c src.html
2. Weka’s resample filter. http://weka.sourceforge.net/doc.dev/weka/filters/super
vised/instance/Resample.html
3. Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A.,
Hestness, J., Hower, D.R., Krishna, T., Sardashti, S., et al.: The gem5 simula-
tor. ACM SIGARCH Comput. Archit. News 39(2), 1–7 (2011)
4. Burger, D., Austin, T.M.: The simplescalar tool set, version 2.0. ACM SIGARCH
Comput. Archit. News 25(3), 13–25 (1997)
5. Dinero IV, T.D.U.C.: Simulator (2012). http://www.cs.wisc.edu/markhill/
DineroIV
6. Elakkumanan, P., Liu, L., Vankadara, V.K., Sridhar, R.: CHIDDAM: a data mining
based technique for cache hierarchy determination in commercial applications. In:
48th Midwest Symposium on Circuits and Systems, pp. 1888–1891. IEEE (2005)
7. Gomez, F.J., Burger, D., Miikkulainen, R.: A neuro-evolution method for dynamic
resource allocation on a chip multiprocessor. In: Proceedings of the International
Joint Conference on Neural Networks, IJCNN 2001, vol. 4, pp. 2355–2360. IEEE
(2001)
8. Guthaus, M., Ringenberg, J., Ernst, D., Austin, T., Mudge, T., Brown, R.:
MiBench: a free, commercially representative embedded benchmark suite. In: Pro-
ceedings of the Fourth Annual IEEE International Workshop on Workload Char-
acterization, pp. 3–14. WWC-4 (Cat. No. 01EX538) (2001)
9. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18
(2009)
10. Jiménez, D.A., Lin, C.: Neural methods for dynamic branch prediction. ACM
Trans. Comput. Syst. (TOCS) 20(4), 369–397 (2002)
11. Khakhaeng, S., Chantrapornchai, C.: On the finding proper cache prediction model
using neural network. In: 2016 8th International Conference on Knowledge and
Smart Technology (KST), pp. 146–151. IEEE (2016)
322 O. Navarro et al.
12. Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S.,
Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with
dynamic instrumentation. In: ACM Sigplan Notices, vol. 40, pp. 190–200. ACM
(2005)
13. Mittal, S.: A survey of architectural techniques for improving cache power effi-
ciency. Sustain. Comput.: Inform. Syst. 4(1), 33–43 (2014)
14. Navarro, O., Leiding, T., Hübner, M.: Configurable cache tuning with a victim
cache. In: 2015 10th International Symposium on Reconfigurable Communication-
centric Systems-on-Chip (ReCoSoC), pp. 1–6. IEEE (2015)
15. Tarjan, D., Thoziyoor, S., Jouppi, N.P.: Cacti 4.0. Technical report, Technical
report HPL-2006-86, HP Laboratories Palo Alto (2006)
16. Vega, M.A., Martı́n, R., Zarallo, F.A., Sánchez, J.M., Gómez, J.A.: Smpcache:
simulador de sistemas de memoria cache en multiprocesadores simétricos. Granada,
XI Jornadas de Paralelismo (2000)
17. Wang, W., Mishra, P., Gordon-Ross, A.: Dynamic cache reconfiguration for soft
real-time systems. ACM Trans. Embed. Comput. Syst. (1) (2012). http://dl.acm.
org/citation.cfm?id=2220340
18. Wang, Y., Chen, L.: Dynamic Branch Prediction Using Machine Learning. ECS-
201A, Fall (2015)
19. Zhang, C., Vahid, F., Najjar, W.: A highly configurable cache architecture for
embedded systems. In: 2003 Proceedings of the 30th Annual International Sympo-
sium on Computer Architecture, pp. 136–146. IEEE (2003)
ArPALib: A Big Number Arithmetic Library
for Hardware and Software Implementations.
A Case Study for the Miller-Rabin
Primality Test
1 Motivation
The big numbers - the integer numbers in computer data representation that
comprise of hundreds of bits, are a foundation for security solutions of today’s
computer systems. Although, some of the modern programming languages allow
programmers to choose an arbitrary variable size, the majority of modern C
language compilers, supports the maximum integer size of 64-bits only. There-
fore, it is necessary to define the custom big data types and create functions for
arithmetic operations from scratch. Alternatively, a developer can benefit from
one of the ready-to-use big number libraries that are offered either commercially
or as open sources [1,2].
The FPGAs have been used for security enhancement algorithms before [3–
6]. Meanwhile, the role of the Programmable SoC (PSoC) solutions, that incor-
porate a CPU subsystem and an FPGA structure in a single chip, is rapidly
growing. Soon, such CPU-FPGA hybrid solutions will become ubiquitous, not
c Springer International Publishing AG 2017
S. Wong et al. (Eds.): ARC 2017, LNCS 10216, pp. 323–330, 2017.
DOI: 10.1007/978-3-319-56258-2 28
324 J. Macheta et al.
only for embedded systems but in server solutions as well. The major obsta-
cle in deploying such systems is a cost of hardware development. Designing of
hardware is time-consuming, cost-intensive, and requires extra developer skills.
However, a shift towards high-level programming can be noticed in design tools
today. Thanks to the High-Level Synthesis, C and C++ codes of the algorithms
can be translated to their Register Transfer Level (RTL) representations and the
process can be controlled by means of inserting pragmas into the source files.
The HLS tools significantly speed up the FPGA system development and lead
the way to CPU-FPGA systems spreading. Unfortunately, some of the software
techniques (i.e. dynamic allocation or recursion) are prohibited by HLS. Thus,
preparing the hardware synthesizable C code still requires some effort and care.
The main goal of this paper is to introduce the Arbitrary Precision Arithmetic
Library (ArPALib) that was developed by the authors. Its main advantage over
other available arithmetic libraries is that it can be implemented both in software
and HLS synthesized hardware, allowing the developer to swiftly create CPU-
FPGA based solutions. It is worth highlighting that the source code of ArPALib
is available online in the repository provided by the authors [7].
For a proper background we will first overview the three big number libraries
that are available for C programmers. We will refer to the GMP [1] and BigDigits
[2] libraries, ap cint.h - built-in library from Xilinx’s Vivado HLS.
GMP library is considered to be one of the fastest big number library
available today. It covers an arbitrary bit-width for signed and unsigned integers
and fixed-point numbers. Its extraordinary performance comes from a variety of
implemented algorithms that are selected according to the actual size of the used
numbers. Additionally, the GMP’s algorithms exploit aggressive optimizations
for selected processor architectures (e.g. AMD K5/K10, Intel Sandy Bridge,
ARM family).
The individual number is represented by the mpz t structure that comprises
the memory pointer to a dynamically-allocated array that stores the value of the
number, and its current size. That kind of representation reduces the memory
read/write operations, and induces basic pointer arithmetic to perform calcula-
tions. Unfortunately, the mentioned coding techniques exclude using GMP from
a HLS design flow.
BigDigits library is an open source arithmetic library that conforms to
ANSI C standard. The authors of BigDigit implemented mainly paper-and-pencil
methods arithmetic algorithms, where arguments must be the same lengths. If
the allocated space is bigger than the actual number length, zero-padding oper-
ation is performed to ensure the result correctness. BigDigits simplicity made
this library a good candidate for HLS, however, some of its algorithms use a
recursion, which is not supported by HLS tools.
In our experiments we used Xilinx’s Vivado HLS environment, that pro-
vides built-in arbitrary precision integer library, included in ap cint.h header.
ArPALib: A Big Number Arithmetic Library for Hardware and Software 325
3 ArPALib Introduction
To overcome problems mentioned in Sect. 2, we created ArPALib, which is fully
synthesizable (by Vivado HLS 2015.4) and C99 compatible library for soft-
ware and hardware implementations. Our goal was to propose a solution that
enables sequential processing of big numbers in blocks of bits of a selected width
to reduce. The code of ArPALib is publicly available under the GNU GPLv3
license [7].
The library can be parametrized to redefine the base integer type (named
uint t), which is used as an elementary computational block of the big number,
and processed in sequential algorithm steps. The base type allows programmers
to force such a bit-width of the co-processor architecture that fits the size of the
selected FPGA device. Furthermore, it can be defined as [u]int for the Vivado
HLS compiler, thus enabling optimization for speed or resources footprint.
A type for the unsigned integer big number is called uintBig t in ArPALib.
It contains an array of uint t elements, and the length of the array is defined at
compile time. Optionally, the uintBig t can hold a variable that keeps the current
size of the stored big number. Thanks to that, the number of read/write opera-
tions is reduced by excluding not used segments from computations, instead of
zero padding operation. For example, the number of memory accesses is limited
to the size of the smaller argument in the add operation. The bigger the difference
of the arguments’ length is, the more significant is the speed-up. Our approach
requires tracking the arguments’ size, so it introduces some extra operations.
However, even if the arguments are of similar size, the overhead that ArPALib
produces is small (e.g. only one comparison operation more for the addition
than the algorithm without modification). The library supports dynamic data
allocation for software implementations to prevent stack overflow problems.
Algorithms implemented in ArPALib are summarized in Table 1. The library
implements all elementary binary operations, comparison and assignment oper-
ations. It also provides input/output tools that include conversion of binary
strings of different formats to big number values and vice versa in the soft-
ware version. Unfortunately, all the performed operations are integer-based only,
therefore, the Schoenhage-Strassen algorithm or Barett reduction are not avail-
able. On the other hand, hardware implementations are modest thanks to that.
326 J. Macheta et al.
10 -4
3.5
2.5
t(n) [ms]
2 ArPALib 32b
BigDigits 2.6
1.5 GMP 6.1.1
1
0.5
0
500 1000 1500 2000 2500 3000 3500 4000
n [b]
0.03
ArPALib 32b
0.025 BigDigits 2.6
GMP 6.1.1
0.02
t(n) [ms]
0.015
0.01
0.005
0
500 1000 1500 2000 2500 3000 3500 4000
n [b]
10 -3
4
GMP 6.1.1
BigDigits 2.6
3
ArPALib 32b
t(n) [ms]
0
500 1000 1500 2000 2500 3000 3500 4000
n [b]
algorithm that was created in the experiment, which is a well-known and widely-
used number primality test that is used in security applications.
The implementation and experiments were performed on Xilinx’s PSoC of
Zynq-7000 family XC7Z020. The chip was a part of the Zedboard platform.
Zynq combines the Cortex-A8 CPU of the ARM family and the small pro-
grammable logic of the Xilinx’s 7 series FPGA. The HLS synthesis and design
flow from Vivado 2015.4 development tool were used. In HLS, the architecture
of the co-processor is formed according to the algorithm coded in the C pro-
gramming language that is accompanied by special directives to steer the hard-
ware synthesis (e.g. control parallelism or select IO interfaces). The coproces-
sor communication interface was built around the AXI4-Lite bus. The through-
put of AXI-Lite is very modest, but it does not influence the performance of
328 J. Macheta et al.
70
65
60
ArPALib 8b
55 ArPALib 16b
50 ArPALib 32b
45
40
35
30
25
20
15
10
5
0
LUT FF BRAM DSP48
6
10
4
10
t(n) [ms]
8b (FPGA)
16b (FPGA)
32b (FPGA)
10 2
32b (CPU)
16b (CPU)
8b (CPU)
0
10
0 1000 2000 3000 4000
log n [b]
2
Fig. 5. ArPALib performance of the Miller-Rabin test in the hardware and software
for n-bit numbers. The 8, 16, and 32-bits base type was tested and ARM Cortex-A8
(667 MHz) was used for the software version
6 Conclusions
The presented experiment proved that the hardware-software design symmetry
come true thanks to HLS tools available today. At present, software routines can
be positioned more easily in hardware to gain better performance. Although it
requires the cautious coding style of the program, that drawback can be miti-
gated by the use of hardware and software compatible libraries like ArPALib.
Acknowledgements. This work was performed thanks to the funds for AGH statu-
tory activity 11.11.230.017.
330 J. Macheta et al.
References
1. Granlund, T.: GNU MP 6.0 Multiple Precision Arithmetic Library. Samurai Media
Limited, Hong Kong (2015)
2. Ireland, D.: BigDigits multiple-precision arithmetic source code. http://www.
di-mgt.com.au/bigdigits.html. Accessed 29 Sept 2016
3. Gielata, A., Russek, P., Wiatr, K.: AES hardware implementation in FPGA for
algorithm acceleration purpose. In: International Conference on Signals and Elec-
tronic Systems, pp. 137–140 (2008)
4. Kryjak, T., Gorgon, M.: Pipeline implementation of the 128-bit block cipher CLE-
FIA in FPGA. In: International Conference on Field Programmable Logic and
Applications, FPL 2009, pp. 373–378 (2009)
5. Dabrowska-Boruch,
A., Gancarczyk, G., Wiatr, K.: Implementation of a RANLUX
based pseudo-random number generator in FPGA using VHDL and impulse C.
Comput. Inf. 32(6), 1272–1292 (2014)
6. Jamro, E., Russek, P., Dabrowska-Boruch,
A., Wielgosz, M., Wiatr, K.: The imple-
mentation of the customized, parallel architecture for a fast word-match program.
Int. J. Comput. Syst. Sci. Eng. 26(4), 285–292 (2011)
7. Macheta, J., et al.: ARPALib repository. https://git.plgrid.pl/projects/ARPALIB/
repos/arpalib. Accessed 29 Sept 2016
8. Miller, G.L.: Riemann’s hypothesis and tests for primality. J. Comput. Syst. Sci.
13(3), 300–317 (1976)
9. Pommerening, K.: Cryptology. Part III. Primality Tests: RSA and Pseudoprimes
28 May 2000. Accessed 21 Feb 2016
10. Pomerance, C., Selfridge, J.L., Wagstaff, S.S.: The pseudoprimes to 25 · 109 . Math.
Comput. 35(151), 1003–1026 (1980)
11. Walter, C.D.: Right-to-left or left-to-right exponentiation? International Workshop
on Constructive Side-Channel Analysis and Secure Design, pp. 40–46 (2010)
12. Conrad, K.: FERMAT’S TEST. http://www.math.uconn.edu/kconrad/blurbs/
ugradnumthy/fermattest.pdf
13. Solovay, R., Strassen, V.: A fast Monte-Carlo test for primality. SIAM J. Comput.
6(1), 84–85 (1977)
14. Bach, E.: Number-theoretic algorithms. Annu. Rev. Comput. Sci. 4(1), 119–172
(1990)
Author Index