das2008

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

326 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO.

3, MARCH 2008

[8] G. Dimitrakopoulos and D. Nikolos, “High-speed parallel-prefix VLSI


Ling adders,” IEEE Trans. Comput., vol. 54, no. 2, pp. 225–231, Feb.
2005.
[9] Y. Choi and E. E. Swartzlander, Jr., “Design of a hybrid prefix adder for
non-uniform input arrival times,” in Proc. SPIE Adv. Signal Process.
Algorithms, Arch., Implementations XII, 2002, pp. 456–465.
[10] H. Ling, “High-speed binary adder,” IBM J. R&D, vol. 25, pp. 156–166,
1981.
[11] T. Lynch and E. E. Swartzlander, Jr., “A spanning tree carry lookahead
adder,” IEEE Trans. Comput., vol. 41, no. 8, pp. 931–939, Aug. 1992.
[12] J. Grad and J. E. Stine, “A hybrid Ling carry-select adder,” in Proc.
38th Asilomar Conf. Signals Syst. Comput., 2004, pp. 1363–1367.

Fig. 5. Sum logic for late increment with critical INC signal.
A Novel Hybrid Parallel-Prefix Adder Architecture With
Efficient Timing-Area Characteristic
Sabyasachi Das and Sunil P. Khatri

Abstract—Two-operand binary addition is the most widely used arith-


metic operation in modern datapath designs. To improve the efficiency of
this operation, it is desirable to use an adder with good performance and
area tradeoff characteristics. This paper presents an efficient carry-looka-
head adder architecture based on the parallel-prefix computation graph.
In our proposed method, we define the notion of triple-carry-operator,
which computes the generate and propagate signals for a merged block
which combines three adjacent blocks. We use this in conjunction with
the classic approach of the carry-operator to compute the generate and
Fig. 6. Sum logic for speculative sum generation. propagate signals for a merged block combining two adjacent blocks.
The timing-driven nature of the proposed design reduces the depth of
the adder. In addition, we use a ripple-carry type of structure in the
nontiming critical portion of the parallel-prefix computation network.
pipeline stage can be added after the 64-bit additions. Fig. 6 shows These techniques help produce a good timing-area tradeoff characteristic.
the sum logic for such a case. The second design is based on [12]. The experimental results indicate that our proposed adder is significantly
To reduce the cycle time, (c0i ; c1i )=(hi0 ; hi1 ) needs to be generated as faster than the popular Brent–Kung adder with some area overhead. On
quickly as possible and SCT2/LSCT2 are the best choices for this appli- the adder hand, the proposed adder also shows marginally faster perfor-
mance than the fast Kogge–Stone adder with significant area savings.
cation because a pipeline design is generally intended for maximizing
the throughput rather than for minimizing the area. Between SCT2 and Index Terms—Arithmetic and logic structures, integrated circuits, logic
design.
LSCT2, LSCT2 is a bit faster as stated in Section II-C.

IV. CONCLUSION I. INTRODUCTION


A formal framework for speculative carry generation is proposed. The complexity and the performance requirement of the datapath op-
The framework is successfully applied to adders using the Ling carry erations implemented in systems-on-chips (SoCs) has increased con-
as well as adders with a normal carry. Including two Ling carry cases, siderably over the years. Since binary adders are one of the most basic
three new speculative prefix schemes are introduced. and widely used arithmetic datapath operations in modern integrated
Several applications for speculative carry generation are presented circuits, they tend to play a critical role in determining the performance
to show how this work broadens the design space of speculative prefix of the design. Hence, developing an efficient adder architecture (from
adders. the standpoint of timing, area, and power) is crucial to improving the
efficiency of the design.
REFERENCES Carry lookahead adders based on parallel prefix computation
[1] N. Burgess, “Prenormalization rounding in IEEE floating-point opera- methods yield the fastest adders. There are several techniques proposed
tions using a flagged prefix adder,” IEEE Trans. Very Large Scale In- for the computation of the parallel prefix. In [1], Sklansky proposes one
tegr. (VLSI) Syst., vol. 13, no. 2, pp. 266–277, Feb. 2005. of the earliest tree-prefix algorithms for adders, where a tree structure
[2] N. Burgess, “The flagged prefix adder and its application in integer
is used to compute the intermediate signals. In the Brent–Kung (BK)
arithmetic,” J. VLSI Signal Process, vol. 31, pp. 263–271, 2002.
[3] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient approach [2], Brent and Kung design the prefix-computation graph
solution of a general class of recurrence equations,” IEEE Trans. in an area-optimal way and the Kogge–Stone (KS) architecture [3] is
Comput., vol. C-22, no. 8, pp. 786–793, Aug. 1973.
[4] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” JACM,
vol. 27, no. 4, pp. 831–838, 1980.
[5] R. P. Brent and H. T. Kung, “A regular layout for parallel adders,” IEEE Manuscript received March 18, 2007; revised June 11, 2007.
Trans. Comput., vol. C-31, no. 3, pp. 260–264, Mar. 1982. S. Das is with Asyst Technologies, Freemont, CA 94538 USA (e-mail:
[6] S. Knowles, “A family of adders,” in Proc. 15th IEEE Symp. Comput. sabya@asyst.com).
Arithmetic, 2001, pp. 277–281. S. P. Khatri is with the Department of Electrical and Computer Engi-
[7] Y. Choi and E. E. Swartzlander, Jr., “Parallel prefix adder design with neering, Texas A&M University, College Station, TX 77843 USA (e-mail:
matrix representation,” in Proc. 17th IEEE Symp. Comput. Arithmetic, sunilkhatri@tamu.edu).
2005, pp. 90–98. Digital Object Identifier 10.1109/TVLSI.2007.915507

1063-8210/$25.00 © 2008 IEEE


IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 3, MARCH 2008 327

optimized for timing. In [4], another prefix-computation architecture


is proposed, where the fan-out of gates increases with the depth of the
prefix computation tree. In [5], a hybrid adder architecture based on
BK and KS is proposed. In [6], a zero-deficiency prefix adder with
minimal depth was introduced. In [7] and [8], the authors present new
algorithms to construct a class of depth-size optimal parallel prefix
circuits. In [9], a parallel prefix adder synthesis was introduced, which
performs two-step area minimization under given timing constraints.
In [10], Choi and Swartzlander present a one-shot batch process that
generates a wide range of designs for a group of parallel prefix adders.
In [11], Dimitrakopoulos and Nikolos save one-logic level of imple- Fig. 1. Block-diagrams of “o” (carry) and “o3” (triple-carry) operators.
mentation leading to faster performance of the parallel-prefix addition.
In [12], a performance evaluation analysis was performed between
flagged prefix adders with the other well-known prefix adders. In [13],
propagate and generate of individual bits is applicable to blocks of ad-
Liu et al. propose an algorithmic approach to generate an irregular
jacent bits also. The propagate and generate value-pairs of these two
parallel-prefix adder. In [14], Lin et al. use domino logic to generate an
blocks are referred to as (gi;j +1 ; pi;j +1 ) and (gj;k ; pj;k ), respectively.
efficient parallel-prefix architecture. Our approach is different from all
In this paper, we denote these pairs of generate and propagate values as
the other approaches mentioned earlier, because we use combination
GPi;j +1 and GPj;k . If the block consists of only one bit, then to rep-
of two types of merged blocks.
resent the value pair of (gi ; pi ), we use the notation of GPi (instead of
In this paper, we propose a new design of an efficient addition block
GPi;i ). Now, if we combine these two adjacent blocks to form a single

continuous block having (i 0 k + 1) bits, the equations for computing


based on the parallel-prefix computation technique. In our approach,
we use the notion of computing the generate and propagate signals for
the generate and propagate values of the combined block is as follows:
a merged block combining three adjacent blocks. We use this in con-
junction with the classic approach of computing generate and propa-
gate signals for a merged block combining two adjacent blocks. Our
design is timing driven in the timing critical path. At the same time, we
gi;k = gi;j +1 _( pi;j +1 ^ gj;k ) (4)
optimize for area in the nontiming critical path. This is another novel pi;k = pi;j +1 ^pj;k : (5)
aspect of our proposed approach.
We have organized the rest of this paper as follows. In Section II, The final output of a parallel prefix computation tree is the set of
we present some background information about the parallel-prefix ar- all the (gi;0 ; pi;0 ) value pairs (for i = 0; 1; . . . ; (n 0 1)). For a two-
chitecture. In Section III, we discuss our proposed approach in detail. operand addition block, the value of the signal gi;0 at every bit is equal
Section IV presents the experimental results. Conclusions are drawn in to the value of the signal carryi+1 (for i = 0; 1; . . . ; (n 0 1)).
Section V. The Brent and Kung adder and [2] the Kogge and Stone adder [3] use
the “o” operator, which performs the computation described in (4) and
(5) (for any given generate and propagate value pairs (gi;j +1 ; pi;j +1 )
II. PRELIMINARIES and (gj;k ; pj;k )). The block diagram of the “o” operator is shown in
Fig. 1(a).
In this section, we briefly explain the concept of the carry looka-
head adder and the parallel-prefix network, using the example of a
two-operand (a and b) addition block. III. OUR APPROACH
In every bit (i) of the two-operand adder block, the two input signals
(ai and bi ) are added to the corresponding carry-in signal (carryi ) to Throughout the rest of this paper, we assume two operands (a and b)
produce the sum output (sumi ). of the adder are n-bit wide, and the output (sum) of the adder is (n +
1)-bit wide. In our approach, we compute the generate and propagate
The equation to produce the sum output is:
signals for each of the individual bits by using the logic presented in
(2) and (3). After computing all the GPi values (gi ; pi ) for each of
the individual bits (i = 0; 1; 2; . . . ; (n 0 1)), these get transmitted to
sumi = ai 8 8 carry
bi i: (1)
the proposed parallel-prefix carry computation tree, described in the
following.
Computation of the carry-in signals at every bit is the most crit- We define the notion of computing the generate and propagate sig-
ical and time-consuming operation. In the carry-lookahead scheme of nals for a merged block comprising three adjacent blocks. Let Bi;j +1 ,
adders, the focus is to design a circuit which can efficiently compute Bj;k+1 , and Bk;l be three adjacent blocks in an adder module. These
the (n 0 1) carry-in signals (c1 to cn ) based on the 2n input bits blocks consist of (i 0 j ), (j 0 k), and (k 0 l + 1) bits, respec-
(a0 ; a1 ; . . . ; an01 and b0 ; b1 ; . . . ; bn01 ). For any given bit-position, tively. In addition, suppose that Bi;j +1 consists of more significant
the generate (gi ) and propagate (pi ) signals are defined as follows: bits than Bj;k+1 , and Bj;k+1 consists of more significant bits than
Bk;l . The propagate and generate value pairs of these three blocks are

(gi;j +1 ; pi;j +1 ), (gj;k+1 ; pj;k+1 ), and (gk;l ; pk;l ), respectively. Now,


gi = ai ^ bi (2)
if we combine these three adjacent blocks to form a single continuous
pi = ai 8 bi : (3) block having (i 0 l +1) bits, then the combined block (Bi;l ) propagates
the carry only if each of the three blocks (Bi;j +1 , Bj;k+1 , and Bk;l )
The key idea behind the parallel prefix computation is as follows. propagates the carry. On the other hand, the combined block (Bi;l )
Let Bi;j +1 and Bj;k be two adjacent blocks in an adder module. generates carry in the following three situations.
These two blocks consist of (i 0 j ) and (j 0 k + 1) bits, respectively, • If the block Bi;j +1 generates a carry. In other words, if (gi;j +1 =
and Bi;j +1 consists of more significant bits than Bj;k . The concept of 1).
328 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 3, MARCH 2008

Fig. 3. Our proposed parallel prefix network (for input width of 24 bits).

we mostly perform the traditional carry operation with the “o” operator
Fig. 2. Our proposed parallel prefix network (for input width of 16 bits). (with some “o3” operators as well). In levels lower than 1, we perform
timing-driven optimization and use a combination of the two types of
operators. To avoid problems due to high fanout nets, we restrict the
• If the block Bj;k+1 generates a carry and the block Bi;j +1 propa- maximum fanout of any net to 5. To maintain this strict limit on fanouts,
gates that carry. In other words, if (gj;k+1 = 1) and (pi;j +1 = 1). we use the triple-carry operators quite aggressively in the bits near the
• If the block Bk;l generates a carry and the blocks Bi;j +1 , Bj;k+1 most significant bit. We note that, in most of the parallel prefix compu-
propagates that carry. In other words, if (gk;l = 1), (pi;j +1 = 1), tation tree designs, the critical paths primarily go through the outputs,
and (pj;k+1 = 1). which are placed near the most significant bit (n 0 1). Hence, we try
The equations for computing the generate and propagate values of to instantiate triple-carry operators in the paths which go through the
the combined block are as follows: critical pins. This reduces the depth along those paths (at the expense
of additional hardware) and improves the performance of the parallel
gi;l = gi;j +1 _ (pi;j +1 ^ gj;k+1 ) _ (pi;j +1 ^ pj;k+1 ^ gk;l )
prefix block. On the other hand, we also note that bits near the least
(6) significant bit typically have positive slack. To exploit this fact and to
pi;l = pi;j +1 ^ pj;k+1 ^ pk;l : (7) perform area reduction, we use a ripple type structure in that part of the
design (without impacting the overall performance of the block). As a
We denote the previous expressions [in (6) and (7)] as the “o3” oper-
result, we claim that our design is timing driven in the timing critical
ator (or the triple-carry operator), which takes three pairs of generate
paths and area driven in the nontiming critical (and area critical) paths.
and propagate values as inputs and produces the generate and propa-
Note that we do not extend the concept of “o3” operators to com-
gate values of the combined large block. The block diagram of the “o3”
bine four adjacent blocks to form a single block (“o4” block). This is
operator (triple-carry-operator) is shown in Fig. 1(b).
because an “o4” operator is a combination of two levels of “o” opera-
We use this in conjunction with the classic approach of computing
tors (total of three “o” operators) arranged in a tree-like fashion. Hence,
generate and propagate signals for a merged block by combining two
unlike the usage of “o3” operator, usage of “o4” operator is not an ar-
adjacent blocks [as explained in the (4) and (5)]. This is called the “o”
chitectural optimization. Depending on the availability of the cells in
operator. The block diagram of the “o” operator is shown in Fig. 1(a).
the technology library, a high-quality technology mapping algorithm in
The key idea in our approach is to find opportunities to use the
commercial logic synthesis tools should be able to efficiently use the
triple-carry-operator (“o3” operator). By analyzing several technology
cells required for the “o4” operator.
libraries provided by commercial vendors, we have found that the worst
delay through a triple-carry operator is between 110% and 130% of the
traditional carry operator. Since the “o3” operator produces the gen- IV. EXPERIMENTAL RESULTS
erate and propagate value pair by combining three blocks as opposed To collect different data points regarding the quality of results for the
to two blocks in the traditional carry operator, the additional 10% to adder blocks, we used the following variations.
30% of delay is well justified. Since the “o3” operator processes one • Adder blocks of different input widths:
additional block compared to the “o” operator, it reduces the depth — We have used adders having different input widths. In Table I,
of the parallel-prefix network. The area of the “o3” operator is 50% to we have shown the final results for adders having input bit-
80% more than the area of the “o” operator, hence, we only use the widths (n) equal to 16, 24, 32, 48, and 64 bits. We refer to
“o3” operator in the timing critical portion of the parallel prefix tree. these blocks as Adder-16, Adder-24, Adder-32, Adder-48, and
This delay characteristic makes triple-carry operator an efficient choice Adder-64, respectively.
in the parallel prefix network. • Different technologies and libraries:
The block diagram of a 16-bit wide proposed parallel prefix graph — two commercial libraries (L1 and L2 ) for 0.13 ;
(of an adder block) is shown in Fig. 2. In addition, Fig. 3 represents — two commercial libraries (L3 and L4 ) for 0.09 .
the block-diagram of a 24-bit wide proposed parallel prefix computa- • Different input arrival time constraints:
tion network. Since the carryi+1 is equal to gi;0 , we label all the out- We used the following input arrival time constraints.
puts in Figs. 2 and 3 as Ci (for each value of i). Due to the lack of — Different input bits of signals a and b arrive at different times.
space in the diagrams, Ci is used as an abbreviation instead of carryi . The motivation for this is as follows. There exists an adder
Both the diagrams are drawn in a levelized fashion. Let us assume that sub-block inside every arithmetic sum-of-product (SOP) and
the inputs to the parallel prefix tree (GPi ) are at level (or depth) 0, multiplier block. Due to the wide usage of SOP and multipliers
shown at the top of Figs. 2 and 3. As we proceed downwards in Figs. 2 in the modern digital designs, the performance of this adder
and 3, the level (or depth) increases by one. In these designs, at level block is crucial to determine the performance of the design.
1, we initially use a large number of “o3” operators. We instantiate Thus, we model this timing constraint [15]. Since an adder
triple-carry operators to combine every GP3p , GP3p+1 , and GP3p+2 is an internal part of a SOP and multiplier block, the arrival
(until each GPi participates in an operator). Then, in the second level, times of different inputs of the adder block are not identical.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 3, MARCH 2008 329

TABLE I
DELAY AND AREA COMPARISON OF ADDER BLOCKS GENERATED BY BK, KS, AND OUR APPROACH

Hence, we cannot directly write timing constraints to control We have implemented the BK adder [2], the KS adder [3], and our
the arrival times for the inputs of the adder. As a result, we proposed adder for different operand widths. We optimized each of
specified the arrival time constraints for the inputs of the SOP the architectures by using a best-in-class commercially available data-
and the multiplier. Once the input arrival times are specified path synthesis tool (run on a workstation with dual 2.2-GHz proces-
for SOPs and multipliers, the synthesis tool propagates the ar- sors, 4 GB memory, and RedHat 7.1 Linux). The synthesis tool per-
rival times through each sub-block inside the SOP and mul- formed the operations like technology-independent optimizations, con-
tiplier. We then report the actual arrival-time numbers to the stant propagation, redundancy removal, technology mapping, timing-
input of the adder sub-block inside SOP and multiplier. In driven optimization, area-driven optimization, incremental optimiza-
this manner, we collected significant amount of data on the ar- tion, etc. Due to the licensing agreements, we are unable to mention
rival-times of the adder inputs. From this arrival-time data, we the name of the commercial tool we used. In Table I, we present the
derived the following equation. We believe that this equation post-synthesis worst-case delay and the total area results for the adder
closely represents the actual arrival timing-constraint for the block for each of the three architectures (as reported by the synthesis
adder sub-blocks inside real-life SOPs and multiplier blocks. tool). To compute worst-case delay, the static timing computation en-
We refer to this category of timing constraints as Arr(late). gine inside the datapath synthesis tool was used. To compute total area,
Let us denote Arr(ai ) as the arrival time of the signal ai . As- the technology library cell information was used.
suming that k is a constant and  is the delay of the fastest In Table I, we report 25 sets of data points for adders of different
two-input AND gate in the technology library, the following is widths, timing constraints, and technology libraries. On an average,
the Arr(late) timing constraint (n is the width of the adder in- our proposed approach results in a 23.96% faster adder (column 7 of
puts): Table I), with 9.39% area penalty (colum 12). When comparing with the
KS adder, then our proposed approach results in a marginally (0.77%)
faster implementation (column 8), with a significant (29.71%) area im-
Arr(ai ) = ik ; 0 i  d3n=5e provement (column 13). Note that like the BK and KS approaches, our
Arr(a ) = d3n=5ek 0 (i 0 d3n=5e) k ; d(3n=5)e < i < n
i approach generates the same structure irrespective of the input arrival
Arr(b ) = ik ; 0  i  d3n=5e
timing constraints. Then, depending on the arrival timing constraint,
i
the technology mapping algorithms will choose different technology
Arr(b ) = d3n=5ek 0 (i 0 d3n=5e) k ; d(3n=5)e < i < n:
i cells to yield different final worst delay (and area) numbers.
To verify the correlation of post-synthesis experimental data with
— All input bits of the signals a and b arrive at the same time. We the post place-and-route data, we performed placement and routing on
refer to this constraint as Arr(same). If k is a constant number, one Adder-32 and one Adder-64 design. For these two testcases, the
then the Arr(same) constraint can be represented as average post-routing worst delay of BK adder, KS adder, and our pro-
posed adder are (normalized to the worst delay of the BK adder): 1.0,
0.78, and 0.76, respectively. Similarly, the post-routing total area of
Arr(ai ) = k ; 0 i<n the BK adder, KS adder, and our proposed adder are (normalized to
Arr(bi ) = k ; 0  i < n: the area of the BK adder): 1.0, 1.34, and 1.07, respectively. The indi-
330 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 3, MARCH 2008

TABLE II
LEAKAGE AND DYNAMIC POWER COMPARISON OF ADDER BLOCKS GENERATED BY BK, KS, AND OUR APPROACH

vidual results for the Adder-32 and Adder-64 designs correlate closely given situation for the given adder block, the synthesis tool will auto-
with the post-synthesis numbers reported in Table I. These results after matically select this architecture without the designer doing anything
place and route confirm our conclusion about the efficient timing area special.
characteristic of our approach.
For the reference purposes, we implemented the ripple adder and
measured its delay and area numbers across all our adder designs, li- V. CONCLUSION
braries, and timing constraints. The experimental data showed that, on
In this paper, we have presented a hybrid approach of implementing
an average, our proposed adder is about 62% faster and 239% larger an adder block based on the fast parallel prefix architecture. The pro-
than the ripple adder.
posed adder exhibits very efficient timing area tradeoff characteristics.
We also performed some additional experimentation by using dif-
Our hybrid architecture is based on the triple-carry operator (“o3”)
ferent values of  in the equation for Arr(late). The modified values and the classical carry-operator (“o”). It works seamlessly with adder
of  we tried are equal to: 1) a two-input XOR gate delay from the tech-
blocks of different widths and across different technology domains
nology library; 2) a two-input OR gate delay; 3) an inverter gate delay;
(0.13 , 0.09 , etc.). The experimental results indicate that our pro-
4) 1 (constant number). In each of these cases, the resulting delay and
posed adder is significantly faster than the popular BK adder with some
area numbers of our adders exhibit substantially same timing area char-
area overhead. On the adder hand, the proposed adder also shows mar-
acteristics as reported in Table I. ginally faster performance than the fast KS adder with significant area
In Table II, we present the post-synthesis leakage and dynamic power
savings.
results for the adder block for each of the three architectures (as re-
ported by the synthesis tool). By analyzing the result, we note that the
power consumption of our adder is more than that of the BK adders, REFERENCES
but significantly less than that of the KS adder.
State-of-the-art designs need to be designed while considering dif- [1] J. Sklansky, “Conditional sum addition logic,” IRE Trans. Electron.
Comput., vol. EC-9, no. 6, pp. 226–231, 1960.
ferent cost metrics (timing, area, power, etc.). The efficient timing-area [2] R. P. Brent and H. T. Kung, “A regular layout for parallel adders,” IEEE
characteristics of our proposed adder design (over the well-known BK Trans. Comput., vol. 31, no. 3, pp. 260–264, Mar. 1982.
and KS adders) is consistent across multiple sizes of adders, timing [3] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient
constraints and technology libraries. This underscores the utility and solution of a general class of recurrence equations,” IEEE Trans.
Comput., vol. C-22, no. 8, pp. 783–791, Aug. 1973.
scalability of our design. Since addition is a frequently used part of
[4] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” J. ACM,
many critical operations in an IC, we believe that many real-life de- vol. 27, no. 4, pp. 831–838, 1980.
signs can significantly benefit from our proposed architecture. [5] T. Han and D. A. Carlson, “Fast area-efficient VLSI adders,” in Proc.
To achieve the proposed architecture’s benefit, the designer does not 8th Symp. Comput. Arithmetic, 1987, pp. 49–56.
require to perform any extra task. Typically a state of the art datapath [6] H. Zhu, C. K. Cheng, and R. Graham, “On the construction of zero-
deficiency parallel prefix circuits with minimum depth,” ACM Trans.
synthesis tool has multiple architectures available for adder and it se- Des. Autom. Electron. Syst., vol. 11, no. 2, pp. 387–409, 2006.
lects the appropriate one depending on the timing constraint, library, [7] Y. C. Lin and C. C. Shih, “A new class of depth-size optimal parallel
the design, etc. As a result, when this architecture is the best in the prefix circuits,” J. Supercomput., vol. 14, no. 1, pp. 39–52, 1999.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 3, MARCH 2008 331

[8] Y. C. Lin and C. Y. Su, “Faster optimal parallel prefix circuits: New
algorithmic construction,” J. Parallel Distrib. Comput., vol. 65, no. 12,
pp. 1585–1595, 2005.
[9] T. Matsunaga and Y. Matsunaga, “Area minimization algorithm for
parallel prefix adders under bitwise delay constraints,” in Proc. 17th
Great Lakes Symp. VLSI, 2007, pp. 435–440.
[10] Y. Choi and E. E. Swartzlander, Jr, “Parallel prefix adder design with
matrix representation,” in Proc. 17th IEEE Symp. Comput. Arithmetic
(ARITH), 2005, pp. 90–98.
[11] G. Dimitrakopoulos and D. Nikolos, “High-speed parallel-prefix VLSI
ling adders,” IEEE Trans. Comput., vol. 54, no. 2, pp. 225–231, Feb.
2005.
[12] V. Dave, E. Oruklu, and J. Saniie, “Performance evaluation of flagged
prefix adders for constant addition,” in Proc. IEEE Int. Conf. Electro/
inf. Technol., 2006, pp. 415–420. Fig. 1. Conventional CAM architecture.
[13] J. Liu, S. Zhou, H. Zhu, and C. K. Cheng, “An algorithmic approach for
generic parallel adders,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided
Des., 2003, pp. 734–740.
[14] R. Lin, K. Nakano, S. Olariu, and A. Y. Zomaya, “An efficient parallel compares the input search data with the stored data. Once matching
prefix sums architecture with domino logic,” IEEE Trans. Parallel Dis- data are found, their addresses are returned as output as shown in Fig. 1.
trib. Syst., vol. 14, no. 9, pp. 922–931, Sep. 2003.
[15] P. F. Stelling and V. G. Oklobdzija, “Design strategies for optimal hy- The vast number of comparison operations required by CAMs con-
brid final adders in a parallel multiplier,” J. VLSI Signal Process., vol. sumes a large amount of power.
14, no. 3, pp. 321–331, 1996. In the past decade, much research on energy reduction has focused
on the circuit and technology domains ([1] provides a comprehensive
survey on CAM designs from circuit to architectural levels). Several
Low Power Design of Precomputation-Based works on reducing CAM power consumption have focused on reducing
match-line power [2]–[4]. Although there has been progress in this area
Content-Addressable Memory
in recent years, the power consumption of CAMs is still high compared
Shanq-Jang Ruan, Chi-Yu Wu, and Jui-Yuan Hsieh with RAMs of similar size. At the same time, research in associative
cache system design for power efficiency at the architectural level con-
tinues to increase. The filter cache [5], [6] and location cache tech-
Abstract—Content-addressable memory (CAM) is frequently used in ap- niques [7] can effectively reduce the power dissipation by adding a very
plications, such as lookup tables, databases, associative computing, and net- small cache. However, the use of these caches requires major modifica-
working, that require high-speed searches due to its ability to improve ap- tions to the memory structure and hierarchy to fit the design. Pagiamtzis
plication performance by using parallel comparison to reduce search time. et al. proposed a cache-CAM (C-CAM) that reduces power consump-
Although the use of parallel comparison results in reduced search time, it
tion relative to the cache hit rate [8]. Lin et al. presented a ones-count
also significantly increases power consumption. In this paper, we propose
a Block-XOR approach to improve the efficiency of low power precompu- precomputation-based CAM (PB-CAM) that achieves low-power, low-
tation-based CAM (PB-CAM). Through mathematical analysis, we found cost, low-voltage, and high-reliability features [9]. Although Cheng
that our approach can effectively reduce the number of comparison oper- [10] further improved the efficiency of PB-CAMs, the approach pro-
ations by 50% on average as compared with the ones-count approach for posed requires considerable modification to the memory architecture
32-bit-long inputs. In our experiment, we used Synopsys Nanosim to esti-
mate the power consumption in TSMC 0.35- m CMOS technology. Com- to achieve high performance. Therefore, it is beyond the scope of the
pared with the ones-count PB-CAM system, the experimental results show general CAM design. Moreover, the disadvantage of the ones count
that our proposed approach can achieve on average 30% in power reduc- PB-CAM system [9] is that it adopts a special memory cell design
tion and 32% in power performance reduction. The major contribution of for reducing power consumption, which is only applicable to the ones-
this paper is that it presents theoretical and practical proofs to verify that
count parameter extractor.
our proposed Block-XOR PB-CAM system can achieve greater power re-
duction without the need for a special CAM cell design. This implies that In this paper, we present a Block-XOR approach for reducing com-
our approach is more flexible and adaptive for general designs. parison operations in the second part for the PB-CAM. Our approach
Index Terms—Content-addressable memory (CAM), low-power, pre- employs a brand new parameter extractor, which can better reduce the
computation. comparison operations required than the ones-count approach [9]. Our
approach reduces power consumption by reducing comparison opera-
tions.
I. INTRODUCTION The remainder of this paper is organized as follows. In Section II,
A content-addressable memory (CAM) is a critical device for appli- we briefly describe the PB-CAM architecture. Our new architecture is
cations involving asynchronous transfer mode (ATM), communication described in Section III, where the design of our Block-XOR parameter
networks, LAN bridges/switches, databases, lookup tables, and tag di- extractor is provided and we exploit mathematical analysis to prove the
rectories, due to its high-speed data search capability. A CAM is a func- effectiveness of our proposed architecture. In Section IV, the experi-
tional memory with a large amount of stored data that simultaneously mental results are provided to further verify our mathematical analysis.
In addition, we also give a comprehensive comparison between [9] and
our approach. Finally, we give a conclusion in Section V.
Manuscript received March 27, 2006; revised March 12, 2007, April 9, 2007,
and June 14, 2007.
The authors are with the Department of Electronic Engineering, National II. PREVIOUS WORK AND OBSERVATION
Taiwan University of Science and Technology, Taipei 106, Taiwan, R.O.C.
(e-mail: sjruan@mail.ntust.edu.tw). To understand our approach more clearly, we need to briefly describe
Digital Object Identifier 10.1109/TVLSI.2007.915514 the architecture of the PB-CAM proposed in [9].

1063-8210/$25.00 © 2008 IEEE

You might also like