banerjee2017
banerjee2017
banerjee2017
Abstract—This paper proposes a design framework for a sparseness etc) to understand the power-performance trade-
family of Parallel Prefix Adders (PPA), based on the concept of offs. Other examples of Sparse PPA’s include Conditional Sum
sparsity. The proposed design is achieved via a generalization on Adder [7] and Carry Select Adder [8]. Overall, most of the
the specific property of Group Segregation presented in Kogge-
Stone Adders (KSA). Here the computations of even and odd previous work are either limited to a narrow scope or ad-hoc
bits (forming two groups) are mutually disjoint. Furthermore, in nature. There lacks a systematic framework of constructing
the two disjoint groups can produce the computations of each Sparse PPA’s in general.
other with a reasonable hardware and performance overhead. This paper focuses on general Sparse PPA’s. A framework
This opens up the potential of (a sparse version) using only one is proposed to derive various sparse designs for any existing
group and extending its computation to cover all the bits. In
this paper, we generalize the formation of Group Segregation PPA structure. The resultant approach opens up a large set of
beyond the special case of KSA, and propose a framework to available PPA designs, with various combinations of structures
design a plethora of sparse PPA designs. These new sparse PPA’s and sparsities, such that a designer can choose among them
can be formed based on two dimensions of choices: an original along the Pareto-front. The motivation stems from the special
type of PPA and group number. As a result, the proposed design properties displayed in Kogge-Stone Adders (KSA’s), where
framework provides significant boost in the PPA design trade-off
space. the bits are divided into two groups (even and odd bits),
and the computations of each group are inherently segregated.
I. I NTRODUCTION Furthermore, the prefix computation property ensures that the
computation result of one group (even or odd bits) can be
Parallel Prefix Adders (PPA’s) provide a unifying foundation derived from the other group with a small overhead. We extend
for designing a large variety of adders. Sparse PPA’s have been this principle to apply to arbitrary PPA structures, as a general
proposed in [1], [2] to explore additional performance-power design framework for Sparse PPA’s.
trade-offs. Sparse PPA’s are constructed by allowing only a
subset of bits (denoted as “Essential Bits” in this paper) of II. P RELIMINARIES
the entire adder to compute their own results, with the rest of PPA’s [9] use three steps to compute the sum bits:
the bits (denoted as “Non-Essential Bits”) deriving their results 1. Bit-level Generate and Propagate: for each input bit
from the Essential Bits. Such a “sparse” design enables more pair Ai and Bi , the bit-level Generate and Propagate signals
varieties of PPA structures and thus facilitating design space (Gi:i , Pi:i ) are computed as:
trade-off explorations.
(Gi:i , Pi:i ) = (Ai · Bi , Ai ⊕ Bi )
In the field of general PPA design, [3], [4] propose ILP
based approaches to obtain a PPA structure satisfying a given 2. Block-level Generate and Propagate: this stage derives
set of constraints such as area, power, and performance. (Gi:0 , Pi:0 ) for all the bits i. Such results can be computed
These approaches are able to search within the huge design via the following prefix formulation of sub-blocks:
space of possible structures, to obtain a PPA that meets the
specifications. However, the runtime complexity is high, and (Gi:j , Pi:j ) = (Gi:k1 , Pi:k1 ) ◦ (Gk1 :k2 , Pk1 :k2 )◦
(Gk2 :k3 , Pk2 :k3 ) ◦ ... ◦ (Gkn :j , Pkn :j )
these approaches do not provide a ready set of PPA’s for the
designers to choose from. when i ≥ k1 ≥ k2 ≥ k3 ≥ ... ≥ kn ≥ j.
Among the previous works: [5] proposes Sparse PPA’s The “ ◦ ” operator is defined as:
by considering the radix and lateral fanout at each stage (Gi:k , Pi,k ) ◦ (Gk,j , Pk,j ) =
of the adder; [1] focuses on the the optimal sparseness of
Koge-Stone and Ladner-Fisher Adders to minimize power; [2] (Gi:k + Pi:k · Gk:j , Pi:k · Pk:j )
proposed an alternative sparse Ladner-Fischer Adder structure;
where i ≥ k ≥ j.
[6] evaluated the various parameters (structure, wire length,
3. Carry and Sum computation: the Gi:0 and Pi:0 of all
This work is supported by NSF Grant CNS-1149661
the bits are used to compute Carry Ci and Sum Si in parallel:
Ci = Gi−1:0 , Si = Pi:i ⊕ Ci
Fig. 1: Various Parallel Prefix Adder Structures in stage 2 to compute the Block GP Results
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Segregation
Group−
Sub−Adder 2: 8−bit KSA
Sub−Adder 1: 8−bit KSA
Fig. 2: a) KSA structure with the Sub-adders shown; b) 2-Sparse KSA: Removing a Sub-Adder and introducing an additional
level of GP computation to compute the final results for the other bits affected by the removal of the Sub-Adder; c)
Replacing the KSA Sub-Adder with Ladner-Fischer (LFA) Sub-Adder to form 2-Sparse LFA
Obviously, the most critical part of a PPA is stage 2 in KSA is known to employ a large amount of hardware (with
terms of area, power and performance. In this crucial stage, the redundancy) to maximize parallelism. This enables the fastest
Block-level Generate and Propagate (Gi:0 , Pi:0 ) is computed computation for block GP results for all the bits.
for every bit i, and each of these computations relies on In fact, two properties make KSA a distinct candidate
the preceding bits from (i − 1) to 0. Therefore the entire for a sparse design. First, the massive hardware in a KSA
circuit consists of a large network of Generate and Propagate is organized in a highly regular fashion - the computations
blocks. Since “ ◦ ” operator is associative, the block-level of even bits and odd bits are entirely segregated and thus
Generate and Propagate computations can be done in arbitrary are carried out independently. Second, due to the prefix-
order. Various PPAs thus use distinct architectures for this computation nature of block GP generation, the result of each
stage. Figure 1 shows this stage 2 structures of various PPA’s, bit can be derived (with a small overhead) from its immediate
including Kogge-Stone (KSA), Han-Carlson (HCA), Ladner- lower neighboring bit. This means that the two groups of even
Fischer (LFA), Ripple-Carry (RCA) and an Arbitrary PPA, and odd bits can be mutually substitutive to each other. From
with each of the circles representing one “ ◦ ” operation, or a these two properties, it is easy to derive a 2-Sparse KSA
GP Block. design, by keeping only one group (the odd bits, for instance)
III. S PARSITY OF A DDERS as Essential Bits and use its results to derive results of the
Non-Essential Bits (the even bits, in this case). The resulting
In general, a k-Sparse adder computes the Block-level GP
2-Sparse KSA is in fact isomorphic to the HCA shown in
results (stage 2 of a PPA) only from every k th bit (denoted as
Figure 1(b).
Essential Bits thereafter), and use their results to derive results
(of stage 2) for the remaining bits (denoted as Non-Essential KSA’s inherent hardware redundancy and regularity is not
Bits). generally presented in other PPA’s. To make a general 2-Sparse
version based on an arbitrary PPA structure, we make the
A. From 2-Sparse KSA to 2-Sparse General PPA following observations by re-organizing the bits of a KSA as
The motivation of a general sparse PPA comes from the is shown in Figure 2(a). Here it becomes clear that, the two
inherent Group Segregation property of KSA’s [10]. A 16-bit groups (odd and even bits, shown in different patterns) each
KSA is shown in Figure 1(a). Comparing to the other designs, by itself forms a structure that is isomorphic to a KSA, albeit
232
half in size (8-bit). In other words, the segregation of groups e e−1 e−2 e−3 e−4 e−5 e−6 e−7 e e−1 e−2 e−3 e−4 e−5 e−6 e−7
in a KSA not only presents the properties of independence and
mutual substitution, but also forms two sub-KSA structures.
A further investigation reveals that these two 8-bit KSA’s
(denoted as Sub-Adders) become completely disjoint after
the first layer of GP-blocks from the input side. Essentially,
the first level of GP operations gathered enough information
for each Sub-Adder, such that each can carry out its own
(a) in log 8 = 3 levels
computations independently.
e e−1 e−2 e−3 e−4 e−5 e−6 e−7
Figure 2(b) then shows that the 2-Sparse version of a KSA
is achieved in three steps: 1) segregating even and odd bits by
the first layer of GP operators; 2) eliminating the computations
of one group (the even bits) as Non-Essential Bits; 3) derive
the final results of the Non-Essential Bits by adding a layer
(c) in 1 level with (b) in 8−1 = 7 levels
of GP-blocks at the output end, from the results of Essential
8−input GP Block with RCA−like structure
Bits.
From the above observations it is then possible to generalize Fig. 3: A variety of Group-Segregation Block structures
a 2-Sparse KSA with other types of Sub-Adders, as is shown
in the example of Figure 2(c). Note that as long as the first
e−1 e−2 e−3 e−4 e−5 e−6 e−7 e−8
layer of GP-blocks (denoted as Group-Segregation) is present,
two independent and mutually substitutable groups (even and
odd bits) are formed. From then on, one group (the odd bits) e−1 e−2 e−3 e−4 e−5 e−6 e−7
can be designated as the Essential Bits, and it can use any form
G(n:0), P(n:0)
of PPA structure as a Sub-Adder. The associated GP operators
(b) after RCA−like Group
for the other group will be eliminated as these even bits now e−8
−Segregation
constitute the Non-Essential Bits. Eventually their results will e−1 e−2 e−3 e−4 e−5 e−6 e−7 e−8
be derived from the Essential Bits via the last layer of GP-
blocks at the output side (denoted as Residual GP Restoration).
Overall, the Group-Segregation in KSA is responsible for
the creation of two mutually exclusive groups. On the other
hand, the Residual GP Restoration for the Non-Essential Bits G(n:0), P(n:0)
is responsible for computing the block GP results from the
Essential Bits. These two building blocks make it possible to (a) after Group−Segregation
in log 8 = 3 levels G(n:0), P(n:0)
design general Sparse PPA’s.
(c) after single level
B. Extension from 2-Sparse PPA to k-Sparse PPA Group−Segregation
For a 2-Sparse design, the Group-Segregation and Residual Fig. 4: Various Residual GP Restoration options
GP Restoration are both straightforward in implementation,
each consisting of one layer of GP blocks. However, extending level (using 2-input GP computation), to k − 1 level (using
from 2 to a k-Sparse design will significantly increase the Ripple-Carry-Adder fashioned sequential computation). Such
overhead in these two building blocks. In fact, this is a choices are shown in Figure 3(a) to (c) for an example of 8-
necessary price to pay, when more than half of the bits are Sparse design. Note that the 1-level design shown in Figure
made Non-Essential. Moreover, a variety of design options can 3(c) essentially employs a high radix [11] GP Block with 8
be explored to realize the functionalities of Group-Segregation inputs, which involves a large overhead internally within each
and Residual GP Restoration. GP operator. In the experimental result section, we choose
1) Group-Segregation designs for k-Sparse PPA’s: The the form in Figure 3(a), where the Group-Segregation always
Group-Segregation stage functions to “bootstrap” the Essential requires log2 k levels.
Bits, such that when they are sent through the Sub-Adder, no 2) Residual GP Restoration designs for k-Sparse PPA’s:
interactions are needed with the Non-Essential Bits. For 2- For each of the Non-Essential Bit n, to achieve the final block
Sparse design, this means every Essential Bit (every 2nd bit) e GP signal of (Gn:0 , Pn:0 ) without a Sub-Adder, two tasks are
needs to compute (Ge:e−1 , Pe:e−1 ) at the Group-Segregation required, both related to the nearest Essential Bit e to the right
stage. For a k-Sparse adder, every k th bit e now needs to of n (assuming n > e). First, the Non-Essential Bit n needs
compute (Ge:e−k+1 , Pe:e−k+1 ), which is identical to a k-bit to compute its local block GP signal of (Gn:e+1 , Pn:e+1 ), and
Block-level GP computation.. this can be done without the results from the Essential Bits.
To form a k-bit block GP computation, design choices range Then, once the results of the essential bit e, (Ge:0 , Pe:0 ) from
from 1-level (using a k-input GP computation), log2 k - the Sub-Adder is available, bit n will integrate the result of
233
Essential Bits Computation Non−Essential Bits Computation m−bit k−Sparse PPA configuration:
Group−Segregation Sub−Adder Residual GP−Restoration
m bits divided into k groups
G(i:i), P(i:i)
k k k k
...
m
...
G(i:0), P(i:0) for all m bits
Fig. 6: Overall design structure of an m-bit k-Sparse PPA
234
X Y
7 6 5 4 3 2 1 0 (x1, y1)
(x4, y4)
0
(x2, y2)
1
(x3, y3)
Logical Interconnects
2
(x1, y1)
(x4, y4)
3 (x2, y2)
(x3, y3)
4 Physical Interconnects:
Half−perimeter of the
smallest rectangle
5 that encompasses all
the GP Blocks
235
Figure 9 shows the design space for the plethora of sparse
PPA’s, including KSA, HCA, LFA, BKA and RCA in power-
! delay domain. While KSA and RCA 1-Sparse and 2-Sparse
#
!
!
%
designs are having a large variation between them in terms of
$
power-delay, in most of the cases, the changes can be observed
%
! to be rather small. Such small changes have been shown in the
!
$
zoomed-in plot of Figure 10. Thus sparsity exploration helps in
%
! “fine-tuning” the adder parameters as is shown in this region.
"$
$
%
VI. C ONCLUSION
! !"!
We have proposed a general design framework for con-
$% structing Sparse PPA’s of various sparsities and forms. This
approach is built on the insight derived from the inherent
" $
property of KSA’s and is generalized on two fronts of the
sparsity and the Sub-Adder structure. The overall scheme
Fig. 9: The 64-bit Adder Design space in power-delay works by: 1) enabling independent computation via a Group-
domain Segregation stage; 2) treating Essential Bits and Non-Essential
Bits differently by only using one copy of the Sub-Adder to
compute for Essential Bits; 3) using the results from the Es-
sential Bits to derive the final results of the Non-Essential Bits.
The proposed framework includes other design flexibilities
regarding how to implement the Group-Segregation and the
Residual GP Restoration building blocks. Simulation results
verify that the proposed design framework indeed opens up a
large design space with various choices along the pareto front,
in terms of performance, area, and power.
R EFERENCES
[1] M. Aktan, D. Baran, and V. Oklobdzija, “Minimizing Energy by Achiev-
ing Optimal Sparseness in Parallel Adders,” in Computer Arithmetic
(ARITH), IEEE 22nd Symposium on, 2015, pp. 10–17.
[2] S. Matthew, M. Anders, R. Krishnamurthy, and S. Borkar, “A 4-GHz
130-nm Address Generation Unit with 32-bit Sparse-Tree Adder Core,”
IEEE Journal of Solid-State Circuits, vol. 38, no. 5, pp. 689–695, May
2003.
Fig. 10: A zoomed-in view of Figure 9. [3] J. Liu, Y. Zhu, H. Zhu, C. Cheng, and J. Lillis, “Optimum prefix adders
All other metrics show improvement with a more sparse in a comprehensive area, timing and power design space,” in Asia and
South Pacific Design Automation Conference, 2007, pp. 609–615.
design. For HCA, a design of sparsity 3 provides an optimal [4] S. Roy, M. Choudhury, R. Puri, and D. Pan, “Towards optimal
case in power, while sparsity of 2 provides an optimal point for performance-area trade-off in adders by synthesis of parallel prefix
the metric of Power-Delay Product (PDP) of all the choices. structures,” IEEE Transactions on Computer-Aided Design of Intergrated
Circuits and Systems, vol. 33, no. 10, pp. 1517–1530, 2014.
For LFA’s, which have a high fan-out for GP Blocks, in- [5] R. Zlatanovici, S. Kao, and B. Nikolic, “Energy–Delay Optimization of
crease in sparsity does not affect significantly area or intercon- 64-Bit Carry-Lookahead Adders With a 240 ps 90 nm CMOS Design
nect. However, delay decreases consistently with increasing Example,” IEEE Journal of Solid-State Circuits, vol. 44, Issue 2, pp.
569–583, Feb 2009.
sparseness due to the fan-outs decreasing. For power, we can [6] B. Zeydel, D. Baran, and V. Oklobdzija, “Energy-Efficient Design
observe an optimum point at sparsity of 3. PDP also has a Methodologies: High-Performance VLSI Adders,” IEEE Journal of
downward trend, but the rate of reduction goes down as the Solid-State Circuits, vol. 45, no. 6, pp. 1220–1233, June 2010.
[7] J. Sklansky, “Conditional-Sum Addition Logic,” IRE Transactions on
sparsity increases. Electronic Computers, vol. EC-9, no. 2, pp. 226–231, June 1960.
For BKA’s, the power and PDP is lower at the typical 1- [8] O. Bedrij, “Carry-Select Adder,” IRE Transactions on Electronic Com-
Sparse over other higher sparsity designs. Nevertheless, we can puters, vol. EC-11, no. 0, pp. 340–346, June 1962.
[9] B. Parhami, Computer Arithmetic Algorithms and Hardware Designs,
observe the presence of the fastest possible BKA at sparsity 2nd ed. Oxford University Press, New York, 2010.
6. [10] S. Banerjee and W. Rao, “A general approach for highly defect tolerant
parallel prefix adder design,” in Design, Automation and Test in Europe
Finally, for RCA, while area, power, number of GP Blocks Conference and Exhibition, 2016, pp. 666–671.
and interconnect length increases, there is a big performance [11] F. Gurkayna, Y. Leblebicit, L. Chaouati, and P. McGuiness, “Higher
improvement with increased sparsity. For instance, 2-Sparse radix kogge-stone parallel prefix adder architectures,” IEEE Interna-
tional Symposium on Circuits and Systems, vol. 5, pp. 609–612, 2000.
RCA is almost twice as fast as a default 1-Sparse RCA.
Furthermore, PDP also has a decreasing trend with increasing
sparsity, indicating the presence of an exploitable power-delay
trade-off.
236