das2008
das2008
das2008
3, MARCH 2008
Fig. 5. Sum logic for late increment with critical INC signal.
A Novel Hybrid Parallel-Prefix Adder Architecture With
Efficient Timing-Area Characteristic
Sabyasachi Das and Sunil P. Khatri
Fig. 3. Our proposed parallel prefix network (for input width of 24 bits).
we mostly perform the traditional carry operation with the “o” operator
Fig. 2. Our proposed parallel prefix network (for input width of 16 bits). (with some “o3” operators as well). In levels lower than 1, we perform
timing-driven optimization and use a combination of the two types of
operators. To avoid problems due to high fanout nets, we restrict the
• If the block Bj;k+1 generates a carry and the block Bi;j +1 propa- maximum fanout of any net to 5. To maintain this strict limit on fanouts,
gates that carry. In other words, if (gj;k+1 = 1) and (pi;j +1 = 1). we use the triple-carry operators quite aggressively in the bits near the
• If the block Bk;l generates a carry and the blocks Bi;j +1 , Bj;k+1 most significant bit. We note that, in most of the parallel prefix compu-
propagates that carry. In other words, if (gk;l = 1), (pi;j +1 = 1), tation tree designs, the critical paths primarily go through the outputs,
and (pj;k+1 = 1). which are placed near the most significant bit (n 0 1). Hence, we try
The equations for computing the generate and propagate values of to instantiate triple-carry operators in the paths which go through the
the combined block are as follows: critical pins. This reduces the depth along those paths (at the expense
of additional hardware) and improves the performance of the parallel
gi;l = gi;j +1 _ (pi;j +1 ^ gj;k+1 ) _ (pi;j +1 ^ pj;k+1 ^ gk;l )
prefix block. On the other hand, we also note that bits near the least
(6) significant bit typically have positive slack. To exploit this fact and to
pi;l = pi;j +1 ^ pj;k+1 ^ pk;l : (7) perform area reduction, we use a ripple type structure in that part of the
design (without impacting the overall performance of the block). As a
We denote the previous expressions [in (6) and (7)] as the “o3” oper-
result, we claim that our design is timing driven in the timing critical
ator (or the triple-carry operator), which takes three pairs of generate
paths and area driven in the nontiming critical (and area critical) paths.
and propagate values as inputs and produces the generate and propa-
Note that we do not extend the concept of “o3” operators to com-
gate values of the combined large block. The block diagram of the “o3”
bine four adjacent blocks to form a single block (“o4” block). This is
operator (triple-carry-operator) is shown in Fig. 1(b).
because an “o4” operator is a combination of two levels of “o” opera-
We use this in conjunction with the classic approach of computing
tors (total of three “o” operators) arranged in a tree-like fashion. Hence,
generate and propagate signals for a merged block by combining two
unlike the usage of “o3” operator, usage of “o4” operator is not an ar-
adjacent blocks [as explained in the (4) and (5)]. This is called the “o”
chitectural optimization. Depending on the availability of the cells in
operator. The block diagram of the “o” operator is shown in Fig. 1(a).
the technology library, a high-quality technology mapping algorithm in
The key idea in our approach is to find opportunities to use the
commercial logic synthesis tools should be able to efficiently use the
triple-carry-operator (“o3” operator). By analyzing several technology
cells required for the “o4” operator.
libraries provided by commercial vendors, we have found that the worst
delay through a triple-carry operator is between 110% and 130% of the
traditional carry operator. Since the “o3” operator produces the gen- IV. EXPERIMENTAL RESULTS
erate and propagate value pair by combining three blocks as opposed To collect different data points regarding the quality of results for the
to two blocks in the traditional carry operator, the additional 10% to adder blocks, we used the following variations.
30% of delay is well justified. Since the “o3” operator processes one • Adder blocks of different input widths:
additional block compared to the “o” operator, it reduces the depth — We have used adders having different input widths. In Table I,
of the parallel-prefix network. The area of the “o3” operator is 50% to we have shown the final results for adders having input bit-
80% more than the area of the “o” operator, hence, we only use the widths (n) equal to 16, 24, 32, 48, and 64 bits. We refer to
“o3” operator in the timing critical portion of the parallel prefix tree. these blocks as Adder-16, Adder-24, Adder-32, Adder-48, and
This delay characteristic makes triple-carry operator an efficient choice Adder-64, respectively.
in the parallel prefix network. • Different technologies and libraries:
The block diagram of a 16-bit wide proposed parallel prefix graph — two commercial libraries (L1 and L2 ) for 0.13 ;
(of an adder block) is shown in Fig. 2. In addition, Fig. 3 represents — two commercial libraries (L3 and L4 ) for 0.09 .
the block-diagram of a 24-bit wide proposed parallel prefix computa- • Different input arrival time constraints:
tion network. Since the carryi+1 is equal to gi;0 , we label all the out- We used the following input arrival time constraints.
puts in Figs. 2 and 3 as Ci (for each value of i). Due to the lack of — Different input bits of signals a and b arrive at different times.
space in the diagrams, Ci is used as an abbreviation instead of carryi . The motivation for this is as follows. There exists an adder
Both the diagrams are drawn in a levelized fashion. Let us assume that sub-block inside every arithmetic sum-of-product (SOP) and
the inputs to the parallel prefix tree (GPi ) are at level (or depth) 0, multiplier block. Due to the wide usage of SOP and multipliers
shown at the top of Figs. 2 and 3. As we proceed downwards in Figs. 2 in the modern digital designs, the performance of this adder
and 3, the level (or depth) increases by one. In these designs, at level block is crucial to determine the performance of the design.
1, we initially use a large number of “o3” operators. We instantiate Thus, we model this timing constraint [15]. Since an adder
triple-carry operators to combine every GP3p , GP3p+1 , and GP3p+2 is an internal part of a SOP and multiplier block, the arrival
(until each GPi participates in an operator). Then, in the second level, times of different inputs of the adder block are not identical.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 3, MARCH 2008 329
TABLE I
DELAY AND AREA COMPARISON OF ADDER BLOCKS GENERATED BY BK, KS, AND OUR APPROACH
Hence, we cannot directly write timing constraints to control We have implemented the BK adder [2], the KS adder [3], and our
the arrival times for the inputs of the adder. As a result, we proposed adder for different operand widths. We optimized each of
specified the arrival time constraints for the inputs of the SOP the architectures by using a best-in-class commercially available data-
and the multiplier. Once the input arrival times are specified path synthesis tool (run on a workstation with dual 2.2-GHz proces-
for SOPs and multipliers, the synthesis tool propagates the ar- sors, 4 GB memory, and RedHat 7.1 Linux). The synthesis tool per-
rival times through each sub-block inside the SOP and mul- formed the operations like technology-independent optimizations, con-
tiplier. We then report the actual arrival-time numbers to the stant propagation, redundancy removal, technology mapping, timing-
input of the adder sub-block inside SOP and multiplier. In driven optimization, area-driven optimization, incremental optimiza-
this manner, we collected significant amount of data on the ar- tion, etc. Due to the licensing agreements, we are unable to mention
rival-times of the adder inputs. From this arrival-time data, we the name of the commercial tool we used. In Table I, we present the
derived the following equation. We believe that this equation post-synthesis worst-case delay and the total area results for the adder
closely represents the actual arrival timing-constraint for the block for each of the three architectures (as reported by the synthesis
adder sub-blocks inside real-life SOPs and multiplier blocks. tool). To compute worst-case delay, the static timing computation en-
We refer to this category of timing constraints as Arr(late). gine inside the datapath synthesis tool was used. To compute total area,
Let us denote Arr(ai ) as the arrival time of the signal ai . As- the technology library cell information was used.
suming that k is a constant and is the delay of the fastest In Table I, we report 25 sets of data points for adders of different
two-input AND gate in the technology library, the following is widths, timing constraints, and technology libraries. On an average,
the Arr(late) timing constraint (n is the width of the adder in- our proposed approach results in a 23.96% faster adder (column 7 of
puts): Table I), with 9.39% area penalty (colum 12). When comparing with the
KS adder, then our proposed approach results in a marginally (0.77%)
faster implementation (column 8), with a significant (29.71%) area im-
Arr(ai ) = ik ; 0 i d3n=5e provement (column 13). Note that like the BK and KS approaches, our
Arr(a ) = d3n=5ek 0 (i 0 d3n=5e) k ; d(3n=5)e < i < n
i approach generates the same structure irrespective of the input arrival
Arr(b ) = ik ; 0 i d3n=5e
timing constraints. Then, depending on the arrival timing constraint,
i
the technology mapping algorithms will choose different technology
Arr(b ) = d3n=5ek 0 (i 0 d3n=5e) k ; d(3n=5)e < i < n:
i cells to yield different final worst delay (and area) numbers.
To verify the correlation of post-synthesis experimental data with
— All input bits of the signals a and b arrive at the same time. We the post place-and-route data, we performed placement and routing on
refer to this constraint as Arr(same). If k is a constant number, one Adder-32 and one Adder-64 design. For these two testcases, the
then the Arr(same) constraint can be represented as average post-routing worst delay of BK adder, KS adder, and our pro-
posed adder are (normalized to the worst delay of the BK adder): 1.0,
0.78, and 0.76, respectively. Similarly, the post-routing total area of
Arr(ai ) = k ; 0 i<n the BK adder, KS adder, and our proposed adder are (normalized to
Arr(bi ) = k ; 0 i < n: the area of the BK adder): 1.0, 1.34, and 1.07, respectively. The indi-
330 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 3, MARCH 2008
TABLE II
LEAKAGE AND DYNAMIC POWER COMPARISON OF ADDER BLOCKS GENERATED BY BK, KS, AND OUR APPROACH
vidual results for the Adder-32 and Adder-64 designs correlate closely given situation for the given adder block, the synthesis tool will auto-
with the post-synthesis numbers reported in Table I. These results after matically select this architecture without the designer doing anything
place and route confirm our conclusion about the efficient timing area special.
characteristic of our approach.
For the reference purposes, we implemented the ripple adder and
measured its delay and area numbers across all our adder designs, li- V. CONCLUSION
braries, and timing constraints. The experimental data showed that, on
In this paper, we have presented a hybrid approach of implementing
an average, our proposed adder is about 62% faster and 239% larger an adder block based on the fast parallel prefix architecture. The pro-
than the ripple adder.
posed adder exhibits very efficient timing area tradeoff characteristics.
We also performed some additional experimentation by using dif-
Our hybrid architecture is based on the triple-carry operator (“o3”)
ferent values of in the equation for Arr(late). The modified values and the classical carry-operator (“o”). It works seamlessly with adder
of we tried are equal to: 1) a two-input XOR gate delay from the tech-
blocks of different widths and across different technology domains
nology library; 2) a two-input OR gate delay; 3) an inverter gate delay;
(0.13 , 0.09 , etc.). The experimental results indicate that our pro-
4) 1 (constant number). In each of these cases, the resulting delay and
posed adder is significantly faster than the popular BK adder with some
area numbers of our adders exhibit substantially same timing area char-
area overhead. On the adder hand, the proposed adder also shows mar-
acteristics as reported in Table I. ginally faster performance than the fast KS adder with significant area
In Table II, we present the post-synthesis leakage and dynamic power
savings.
results for the adder block for each of the three architectures (as re-
ported by the synthesis tool). By analyzing the result, we note that the
power consumption of our adder is more than that of the BK adders, REFERENCES
but significantly less than that of the KS adder.
State-of-the-art designs need to be designed while considering dif- [1] J. Sklansky, “Conditional sum addition logic,” IRE Trans. Electron.
Comput., vol. EC-9, no. 6, pp. 226–231, 1960.
ferent cost metrics (timing, area, power, etc.). The efficient timing-area [2] R. P. Brent and H. T. Kung, “A regular layout for parallel adders,” IEEE
characteristics of our proposed adder design (over the well-known BK Trans. Comput., vol. 31, no. 3, pp. 260–264, Mar. 1982.
and KS adders) is consistent across multiple sizes of adders, timing [3] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient
constraints and technology libraries. This underscores the utility and solution of a general class of recurrence equations,” IEEE Trans.
Comput., vol. C-22, no. 8, pp. 783–791, Aug. 1973.
scalability of our design. Since addition is a frequently used part of
[4] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” J. ACM,
many critical operations in an IC, we believe that many real-life de- vol. 27, no. 4, pp. 831–838, 1980.
signs can significantly benefit from our proposed architecture. [5] T. Han and D. A. Carlson, “Fast area-efficient VLSI adders,” in Proc.
To achieve the proposed architecture’s benefit, the designer does not 8th Symp. Comput. Arithmetic, 1987, pp. 49–56.
require to perform any extra task. Typically a state of the art datapath [6] H. Zhu, C. K. Cheng, and R. Graham, “On the construction of zero-
deficiency parallel prefix circuits with minimum depth,” ACM Trans.
synthesis tool has multiple architectures available for adder and it se- Des. Autom. Electron. Syst., vol. 11, no. 2, pp. 387–409, 2006.
lects the appropriate one depending on the timing constraint, library, [7] Y. C. Lin and C. C. Shih, “A new class of depth-size optimal parallel
the design, etc. As a result, when this architecture is the best in the prefix circuits,” J. Supercomput., vol. 14, no. 1, pp. 39–52, 1999.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 3, MARCH 2008 331
[8] Y. C. Lin and C. Y. Su, “Faster optimal parallel prefix circuits: New
algorithmic construction,” J. Parallel Distrib. Comput., vol. 65, no. 12,
pp. 1585–1595, 2005.
[9] T. Matsunaga and Y. Matsunaga, “Area minimization algorithm for
parallel prefix adders under bitwise delay constraints,” in Proc. 17th
Great Lakes Symp. VLSI, 2007, pp. 435–440.
[10] Y. Choi and E. E. Swartzlander, Jr, “Parallel prefix adder design with
matrix representation,” in Proc. 17th IEEE Symp. Comput. Arithmetic
(ARITH), 2005, pp. 90–98.
[11] G. Dimitrakopoulos and D. Nikolos, “High-speed parallel-prefix VLSI
ling adders,” IEEE Trans. Comput., vol. 54, no. 2, pp. 225–231, Feb.
2005.
[12] V. Dave, E. Oruklu, and J. Saniie, “Performance evaluation of flagged
prefix adders for constant addition,” in Proc. IEEE Int. Conf. Electro/
inf. Technol., 2006, pp. 415–420. Fig. 1. Conventional CAM architecture.
[13] J. Liu, S. Zhou, H. Zhu, and C. K. Cheng, “An algorithmic approach for
generic parallel adders,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided
Des., 2003, pp. 734–740.
[14] R. Lin, K. Nakano, S. Olariu, and A. Y. Zomaya, “An efficient parallel compares the input search data with the stored data. Once matching
prefix sums architecture with domino logic,” IEEE Trans. Parallel Dis- data are found, their addresses are returned as output as shown in Fig. 1.
trib. Syst., vol. 14, no. 9, pp. 922–931, Sep. 2003.
[15] P. F. Stelling and V. G. Oklobdzija, “Design strategies for optimal hy- The vast number of comparison operations required by CAMs con-
brid final adders in a parallel multiplier,” J. VLSI Signal Process., vol. sumes a large amount of power.
14, no. 3, pp. 321–331, 1996. In the past decade, much research on energy reduction has focused
on the circuit and technology domains ([1] provides a comprehensive
survey on CAM designs from circuit to architectural levels). Several
Low Power Design of Precomputation-Based works on reducing CAM power consumption have focused on reducing
match-line power [2]–[4]. Although there has been progress in this area
Content-Addressable Memory
in recent years, the power consumption of CAMs is still high compared
Shanq-Jang Ruan, Chi-Yu Wu, and Jui-Yuan Hsieh with RAMs of similar size. At the same time, research in associative
cache system design for power efficiency at the architectural level con-
tinues to increase. The filter cache [5], [6] and location cache tech-
Abstract—Content-addressable memory (CAM) is frequently used in ap- niques [7] can effectively reduce the power dissipation by adding a very
plications, such as lookup tables, databases, associative computing, and net- small cache. However, the use of these caches requires major modifica-
working, that require high-speed searches due to its ability to improve ap- tions to the memory structure and hierarchy to fit the design. Pagiamtzis
plication performance by using parallel comparison to reduce search time. et al. proposed a cache-CAM (C-CAM) that reduces power consump-
Although the use of parallel comparison results in reduced search time, it
tion relative to the cache hit rate [8]. Lin et al. presented a ones-count
also significantly increases power consumption. In this paper, we propose
a Block-XOR approach to improve the efficiency of low power precompu- precomputation-based CAM (PB-CAM) that achieves low-power, low-
tation-based CAM (PB-CAM). Through mathematical analysis, we found cost, low-voltage, and high-reliability features [9]. Although Cheng
that our approach can effectively reduce the number of comparison oper- [10] further improved the efficiency of PB-CAMs, the approach pro-
ations by 50% on average as compared with the ones-count approach for posed requires considerable modification to the memory architecture
32-bit-long inputs. In our experiment, we used Synopsys Nanosim to esti-
mate the power consumption in TSMC 0.35- m CMOS technology. Com- to achieve high performance. Therefore, it is beyond the scope of the
pared with the ones-count PB-CAM system, the experimental results show general CAM design. Moreover, the disadvantage of the ones count
that our proposed approach can achieve on average 30% in power reduc- PB-CAM system [9] is that it adopts a special memory cell design
tion and 32% in power performance reduction. The major contribution of for reducing power consumption, which is only applicable to the ones-
this paper is that it presents theoretical and practical proofs to verify that
count parameter extractor.
our proposed Block-XOR PB-CAM system can achieve greater power re-
duction without the need for a special CAM cell design. This implies that In this paper, we present a Block-XOR approach for reducing com-
our approach is more flexible and adaptive for general designs. parison operations in the second part for the PB-CAM. Our approach
Index Terms—Content-addressable memory (CAM), low-power, pre- employs a brand new parameter extractor, which can better reduce the
computation. comparison operations required than the ones-count approach [9]. Our
approach reduces power consumption by reducing comparison opera-
tions.
I. INTRODUCTION The remainder of this paper is organized as follows. In Section II,
A content-addressable memory (CAM) is a critical device for appli- we briefly describe the PB-CAM architecture. Our new architecture is
cations involving asynchronous transfer mode (ATM), communication described in Section III, where the design of our Block-XOR parameter
networks, LAN bridges/switches, databases, lookup tables, and tag di- extractor is provided and we exploit mathematical analysis to prove the
rectories, due to its high-speed data search capability. A CAM is a func- effectiveness of our proposed architecture. In Section IV, the experi-
tional memory with a large amount of stored data that simultaneously mental results are provided to further verify our mathematical analysis.
In addition, we also give a comprehensive comparison between [9] and
our approach. Finally, we give a conclusion in Section V.
Manuscript received March 27, 2006; revised March 12, 2007, April 9, 2007,
and June 14, 2007.
The authors are with the Department of Electronic Engineering, National II. PREVIOUS WORK AND OBSERVATION
Taiwan University of Science and Technology, Taipei 106, Taiwan, R.O.C.
(e-mail: sjruan@mail.ntust.edu.tw). To understand our approach more clearly, we need to briefly describe
Digital Object Identifier 10.1109/TVLSI.2007.915514 the architecture of the PB-CAM proposed in [9].