Algorithm and Architecture For Logarithm, Exponential, and Powering Computation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO.

9, SEPTEMBER 2004 1085

Algorithm and Architecture for Logarithm,


Exponential, and Powering Computation
J.-A. Piñeiro, M.D. Ercegovac, Fellow, IEEE, and J.D. Bruguera, Member, IEEE

Abstract—An architecture for the computation of logarithm, exponential, and powering operations is presented in this paper, based on
a high-radix composite algorithm for the computation of the powering function (XY ). The algorithm consists of a sequence of
overlapped operations: 1) digit-recurrence logarithm, 2) left-to-right carry-free (LRCF) multiplication, and 3) online exponential. A
redundant number system is used and the selection in 1) and 3) is done by rounding except from the first iteration, when selection by
table look-up is necessary to guarantee the convergence of the recurrences. A sequential implementation of the algorithm, with a
control unit which allows the independent computation of logarithm and exponential, is proposed and the execution times and
hardware requirements are estimated for single and double-precision floating-point computations. These estimates are obtained for
radices from r ¼ 8 to r ¼ 1; 024, according to an approximate model for the delay and area of the main logic blocks and help
determining the radix values which lead to the most efficient implementations: r ¼ 32 and r ¼ 128.

Index Terms—Computer arithmetic, elementary function approximation, digit-recurrence algorithms, logarithm, exponential,
powering, high-radix, LRCF multiplication, online arithmetic.

1 INTRODUCTION
(XY ), logarithm (log X), and exponential (2X )
P OWERING
are important operations in computer 3D graphics,
digital signal processing (DSP), scientific computing,
a digit-recurrence algorithm for powering computation is not
feasible due to its high intrinsic complexity.
In this paper, we give a detailed description of an
artificial neural networks, logarithmic number systems optimized composite iterative algorithm for the computa-
(LNS), and multimedia applications [12], [16], [21], [26], tion of the powering function (XY ), for a floating-point
[29]. As other elementary functions, such as square root, input operand X ¼ Mx 2Ex and integer1 b-bit operand Y ,
reciprocal square root, and trigonometric functions, they and extend it to powering operations with exponents of the
have been traditionally computed by software routines [3], type Y ¼ 1=q, with q integer. An abbreviated previous
version of our algorithm with integer exponent was
[9], [10], [13]. These routines provide very accurate results,
presented in [23]. The final result, XY , was computed as
but are often too slow for numerically intensive or real-time
Z ¼ Mz 2Ez ¼ eY lnðMz Þ 2Y Ez through a sequence of overlapped
applications. The timing constraints of these applications
operations. The first step consisted of computing lnðMz Þ by
have led to the development of dedicated hardware for the
using a high-radix (r ¼ 2b ) digit-recurrence algorithm with
computation of elementary functions, including the im- selection by rounding [22]. An intermediate computation
plementation of table-based algorithms [17], [25], [27], [28], YL lnðMz Þ, with YL ¼ Y log2 ðeÞ, was carried out by a high-
functional iteration methods [11], [18], [20], and digit- radix left-to-right carry-free (LRCF) multiplication opera-
recurrence algorithms [2], [7]. tion [5]. Another LRCF multiplication by ln 2 was per-
Accurately computing the floating-point powering func- formed to guarantee the convergence of the algorithm and,
tion is considered difficult [17] and the prohibitive hardware as the last step, the exponential of the resulting product was
computed by an online high-radix algorithm, with online
requirements of a table-based implementation (note that XY
delay  ¼ 2 and selection by rounding [24].
is a 2-variable function) have led only to partial solutions, In the optimized version of our algorithm, the final result
such as powering algorithms for a constant exponent p [21], XY can be computed directly as Z ¼ Mz 2Ez ¼ 2Y log2 ðMx Þ 2Y Ex ,
which allows avoiding the computation of YL ¼ Y log2 ðeÞ
[27] or for very low precision [12]. A direct implementation of and eliminates the need for a second LRCF multiplication,
reducing the overall latency by one cycle. The expressions
. J.-A. Piñeiro was with the University of Santiago de Compostela and is of the recurrences in the computations of the logarithm and
now with Intel Barcelona Research Center, Edif. Vertex II, c) Jordi Girona, the exponential must be slightly modified and extra look-up
29-3A, 08034 Barcelona, Spain. E-mail: alex@dec.usc.es. tables storing lj = lnð2Þ and ej = lnð2Þ are now required.
. M.D. Ercegovac is with the Department of Computer Science, University
of California at Los Angeles, 4732E Boelter Hall, Los Angeles, CA 90095. In the stages computing such operations (logarithm and
E-mail: milos@cs.ucla.edu. exponential), selection by table look-up is still performed in
. J.D. Bruguera is with the Department of Electronic and Computer the first iteration to guarantee the convergence of both
Engineering, University of Santiago de Compostela, 15782, Santiago de
Compostela, Spain. E-mail: bruguera@dec.usc.es.
1. The computation of the powering operation with integer exponent in
Manuscript received 7 May 2003; revised 19 Dec. 2003; accepted 29 Jan. 2003. the range ½1; 128 is useful, for instance, in computer graphics lighting
For information on obtaining reprints of this article, please send e-mail to: computations (the lighting specular component in the OpenGL graphics
tc@computer.org, and reference IEEECS Log Number TC-0033-0503. pipeline [12]).
0018-9340/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
1086 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 9, SEPTEMBER 2004

algorithms, with the online delay of two cycles in the XY ¼ 2Y log2 ðXÞ : ð1Þ
exponential scheme allowing addressing of the initial tables 3
Considering a floating-point input operand X ¼ Mx 2Ex ,
one cycle in advance and reducing the delay of the critical with Mx the n-bit significand and Ex the exponent, and an
path in this stage. integer b-bit4 input operand Y :
A sequential architecture for powering computation was
Ex Ex
proposed in [23], with radix r ¼ 128. Such an architecture XY ¼ 2Y log2 ðMx 2 Þ
¼ 2Y ðlog2 ðMx Þþlog2 ð2 ÞÞ

can actually be implemented with any radix r  8, although Ex Y Ex

the overall hardware requirements increase with r and an ¼ 2Y log2 ðMx Þ 2Y log2 ð2 Þ
¼ 2Y log2 ðMx Þ 2log2 ð2 Þ ð2Þ
Y log2 ðMx Þ Y Ex
analysis of the trade offs between area and speed is ¼2 2 :
necessary for determining which radix values result in the
most efficient implementations. We perform such an According to (2), the powering function can be calculated
analysis in this paper for our optimized algorithm.2 by a sequence of operations consisting of the logarithm of
The analysis performed is based on estimates obtained the significand Mx , a multiplication by Y , and the
for single and double-precision computations and for radix exponential of the resulting product. The form of the result
values going from r ¼ 8 to r ¼ 1; 024, according to an is a significand 2Y log2 ðMx Þ and an exponent Y Ex , typical of a
approximate model for the delay and area cost of the main floating-point operand.5
logic blocks employed in the proposed architecture. The For an efficient implementation of the powering func-
main results of our analysis are that a fast implementation tion, the computation of the operations involved must be
can be obtained when using r ¼ 128, but an implementation overlapped, which requires a left-to-right most-significant
with r ¼ 32 is more suitable for applications with tighter digit first (MSDF) mode of operation and the use of a
area constraints. There is no advantage in using values r > redundant number system.
128 since similar or slower execution times are achieved, A problem of the proposed algorithm is the range of the
with much higher hardware requirements. digit-recurrence exponential, which is ð ln 2; ln 2Þ, while
Since the computations of logarithm and exponential are the argument of the exponential here is Y log2 ðMx Þ, with Y
included in our algorithm for powering computation, some an integer. To extend the range of convergence of this
minor changes can be made to the architecture to allow for algorithm and, thus, guarantee the convergence of the
the independent computation of logarithm and exponential, overall proposed method, we extract the integer I and
with lower latencies than powering, making the architec- fractional part F of the product serially, which requires Y to
ture an interesting alternative when implementing a be a b-bit (or 2b-bit) integer.
dedicated unit for elementary function computation. The exponential becomes:
The rest of this paper is structured as follows: We first
focus on the algorithm for powering computation with 2Y log2 ðMx Þ ¼ 2I 2F ð3Þ
integer exponent Y , giving a detailed description of the and, therefore, according to (2),
algorithm and its error analysis in Section 2 and detailing
the architecture implementing the algorithm in Section 3, XY ¼ 2F 2ðIþY Ex Þ ; ð4Þ
with an overview of the high-radix logarithm, LRCF
multiplication, and online high-radix exponential stages. with I and F the integer and fractional parts of Y log2 ðMx Þ,
The evaluation of the proposed architecture for single and resulting in a bounded argument for the exponential.
double-precision floating-point results, for radix values In summary, as illustrated in Fig. 1 for single-precision
going from r ¼ 8 to r ¼ 1; 024, is presented in Section 4. In computations with r ¼ 128, our algorithm for the computa-
Section 5, we explain the minor modifications to be made to tion of the powering function consists of three steps. The
number of iterations to be performed in each stage will be
the proposed architecture in order to allow for the
explained in Section 2.2.
independent computation of logarithm and exponential.
We show how to extend our algorithm for powering 1. Computing the logarithm of the input significand,
computation to exponents of the type Y ¼ 1=q, with log2 ðMx Þ, by employing a high-radix digit-recur-
q integer, in Section 6. Finally, the main contributions made rence algorithm [22].
in this paper are summarized in Section 7. 2. Computing the product Y log2 ðMx Þ by using a serial
left-to-right carry-free (LRCF) multiplication scheme
2 ALGORITHM FOR POWERING COMPUTATION [5]. The integer I and fractional F parts of
Y log2 ðMx Þ are extracted serially.
In this section, we present a new composite iterative 3. Computing the exponential Mz ¼ 2F by employing
algorithm for the computation of the powering function an online high-radix algorithm [24].
(XY ), for a floating-point operand X and an integer b-bit
The exponent Ez can be computed in parallel by an
exponent Y , describing the algorithm and its error analysis. integer multiply-add unit. The output of the algorithm is
2.1 Overview the floating-point normalized result:
The algorithm is based on the well-known identity 3. The algorithm can be adapted to a normalized fixed-point input
operand X ¼ Mx , 1  Mx < 2, resulting in XY ¼ 2Y log2 ðMx Þ .
2. The exponent Y can be 2b-bit wide (instead of b-bit wide) when a 4. The size of the exponent Y can be set to 2b-bits when the range must be
maximum value of 2b  1 does not meet the application requirements. We extended, for instance, to perform lighting computations if the radix is
have chosen, for our study, duplicating the size of the input operand Y for r < 128.
radix values r < 128, taking into account the range of the integer exponent 5. For fixed-point computations, the value 2Y log2 ðMx Þ must be shifted Y Ex
in the OpenGL specular lighting computations. positions.

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
~ EIRO ET AL.: ALGORITHM AND ARCHITECTURE FOR LOGARITHM, EXPONENTIAL, AND POWERING COMPUTATION
PIN 1087

Fig. 1. Operation flow of the powering algorithm (single-precision computation, r ¼ 128).

XY ¼ Mz 2Ez ¼ 2F 2ðIþY Ex Þ : ð5Þ The difference between the exact result and the obtained
The use of redundancy results in 2 2 ð0:5; 2Þ and, F result must be bounded:
therefore, a normalization of the final result may be
j2F  2F 2LRCF  e j < 2n1 : ð9Þ
necessary. However, the condition F < 0 can be determined
in advance to the last iterations of the exponential and the F
Taking into account that 2 2 ð0:5; 2Þ:
final normalization can be performed with no extra delay.
The overall latency of the algorithm, as shown in Fig. 1, j1  2l Y þm  e 21 j < 2n2 : ð10Þ
can be estimated as
For a precision of n bits and a radix r ¼ 2b , a set of
latency ¼ Ne þ ð þ 1Þ þ 1 þ ; ð6Þ minimum values for l , m , and e must be determined to
guarantee a final result accurate to n bits. The values of the
with Ne the number of iterations of the online exponential, error parameters set the precision to be reached in each
ð þ 1Þ cycles to accommodate the online delay, one cycle stage, nl , ne , and nm . These parameters determine a
due to the LRCF multiplication, and  extra cycles due to minimum number of iterations of the logarithm (Nl ) and
the computation of the integer part I of Y log2 ðMx Þ. I can be the exponential (Ne ) to be performed. As shown in Fig. 1,
obtained in a single cycle ( ¼ 1) when Y is a b-bit integer, the number of iterations of the LRCF multiplication to be
while  ¼ 2 extra cycles are necessary when a range extension performed must be the same as those of the logarithm, Nl ,
has been performed by allowing Y to be 2b-bit wide. because all the information must reach the exponential
A latency reduction by one cycle is achieved regarding stage. The parameters Nl , Ne , nl , ne , and nm set the size of
the algorithm proposed in [19], [23] due to the elimination the look-up tables, adders, and multipliers to be used.
of the second LRCF multiplication, required in that case to Moreover, gl , ge , and gm guard bits must be employed to
guarantee the convergence of the algorithm, which was guarantee that, in each stage of the powering algorithm, the
iteration errors do not affect the achievement of the
based on the identity XY ¼ eY ln X instead of XY ¼ 2Y log2 X .
required precisions nl , ne , and nm .
2.2 Error Analysis The critical parameter to be minimized first is e since it
is directly related to the required precision ne to be reached
The final error in the algorithm consists of the accumulation
in the exponential stage and, therefore, to Ne , the number of
of the errors due to the cascaded implementation of a set of
iterations of the online exponential algorithm to be
operations and must be bounded by 2n1 before the final
performed.6
rounding.
As an example, we show in Table 1 the set of minimum
Let l be the error in the computation of the logarithm of
values for the considered parameters for single (n ¼ 24) and
the input significand: lnðMx Þ þ l . When the LRCF multi-
double-precision (n ¼ 53) computations, when the radix is
plication is performed, the associated error is: r ¼ 128. Ne and Nl are given in cycles, and ne , nl , and nm are
LRCF ¼ l Y þ m ; ð7Þ given in number of bits.

with m the error due to the LRCF multiplication scheme in


the computation of the product. 3 IMPLEMENTATION
The integer I and fractional F parts of the product In this section, a sequential architecture is proposed for the
Y log2 ðMx Þ are extracted serially, with no error affecting the implementation of our high-radix powering algorithm. We
integer part. The next operation to be performed is the outline here the main computations involved: 1) high-radix
exponential 2F . The obtained result is: logarithm, 2) high-radix LRCF multiplication, and 3) online
high-radix exponential, and then describe the main features
2F 2LRCF þ e ; ð8Þ
6. Note that Ne is the only parameter that affects the latency of the
with e the error in the computation of the exponential. algorithm since  ¼ 2 for any n and any r  8, as shown in [24].

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
1088 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 9, SEPTEMBER 2004

TABLE 1
Example of Parameters for Powering Computation (r ¼ 128)

of the logic blocks employed. Fig. 2 shows the block radix and lj is a radix-r digit. This form of fj allows the use
diagram of the proposed architecture. Single thick lines of a shift-and-add implementation.
denote long-word (around n bits) operands/variables in The recurrences for performing such multiplicative
parallel form, single thin lines denote short-word (up to normalization and computing the logarithm digits are [22]:
11 bits) operands/variables in parallel form, and double
Wl ½j þ 1 ¼ rðWl ½j þ lj þ lj Wl ½jrj Þ
lines denote single-digit (b bits) variables (Rj , I, and F j). ð14Þ
R½j þ 1 ¼ rR½j  rj1 log2 ð1 þ lj rj Þ  rRj ;
3.1 High-Radix Logarithm
A high-radix digit-recurrence algorithm for the computa- with j  1, Wl ½1 ¼ rðMx  1Þ, R½1 ¼ 0, and R1 ¼ 0. For a
tion of lnðMx Þ is described in detail in [19], [22]. Some result precision of nl bits, Nl ¼ dnl =be iterations are
modifications and optimizations have been made in the necessary. The scaled recurrence R½j has been defined as
algorithm used here, according to the operation flow in the
R½j ¼ rj2 L½j ð15Þ
optimized algorithm for powering computation, and also a
slightly different notation from that used in [22] is in order to extract a radix-r digit Rj per iteration from the
employed. same bit-positions in all iterations.
The algorithm for the computation of log2 ðMx Þ is based The block diagram of the high-radix logarithm stage is
on the identity shown in Fig. 3, with double lines for SD operands, thin
 Y  X ones for single-digit operands, and thick lines for parallel
log2 ðMx Þ ¼ log2 Mx fj  log2 ðfj Þ: ð11Þ operands. T ABðrl1 Þ and T ABð log l1 Þ are the look-up
tables storing rl1 and  log2 ð1 þ l1 r1 Þ, respectively, ad-
Provided that the following condition is satisfied: dressed by the b þ 1 most significant bits of the input
Y operand Mx .
Mx fj ! 1; ð12Þ The selection of the digits lj in iterations j  2 is done by
rounding an estimate of the residual. Such an estimate is
then
obtained by truncating the signed-digit representation of
X
 log2 ðfj Þ ! log2 ðMx Þ: ð13Þ Wl ½j to t fractional bits. The selection function is
^ l ½jÞ:
lj ¼ roundðW ð16Þ
The condition (12) can be achieved using a multiplicative
normalization, which consists of determining a sequence fj The sign of the digit lj is defined as opposite of the
such that Mx is transformed to 1 by successive multi- sign of Wl ½j in order to satisfy a bound on the residual,
plications. To simplify the multiplications, it is convenient thus assuring the convergence. The digit set for lj is
to define the constants as fj ¼ 1 þ lj rj , where r ¼ 2b is the fðr  1Þ; . . . ; 1; 0; 1; . . . ; ðr  1Þg.
Iteration j ¼ 1 does not converge with selection by round-
ing, so the selection of l1 is performed by table look-up. This
table is addressed by the b þ 1 most significant bits of the
input operand Mx and the selection is done in such a way that
the value of jl2 j is bounded according to the convergence
conditions [22]. However, this results in an overredundant
digit l1 (b þ 1 bits), increasing by one bit the size of the
multiplier operand. The convergence conditions also deter-
mine a minimum value of t ¼ 2 and the radix r  8.
A multiply-add unit is used for the computation of the
residual recurrence, unlike in the algorithm proposed in
[22], where a separated SD multiplier and SDA4 were used.
The digit l1 is stored in the look-up table T ABðrl1 Þ already
in SD-4 recoded form to reduce the delay of the path
containing the table and the multiply-add unit.
The logarithm constants are stored in a look-up table
whose size grows exponentially with the radix. However, an
approximation lj r1 =lnð2Þ can be used in iterations j 
dNl =2e þ 1 in order to reduce the overall hardware require-
Fig. 2. Block diagram of the proposed architecture. ments of the algorithm, with a look-up table storing lj =lnð2Þ.

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
~ EIRO ET AL.: ALGORITHM AND ARCHITECTURE FOR LOGARITHM, EXPONENTIAL, AND POWERING COMPUTATION
PIN 1089

Fig. 3. Block diagram of the high-radix logarithm stage.

The use of redundant representation is mandatory in our LRCF multiplier are used without conversion. The block
algorithm for powering computation due to the left-to-right diagram of the LRCF multiplication stage is shown in Fig. 4.
operation flow and results in faster execution times by The adapted LRCF algorithm has two operands, A and
making the additions independent of the precision. D. A is the radix-2 representation of the operand a such that
P m
a ¼ ni¼1 Ai 2i , with Ai 2 f0; 1g. D is the recoded radix-r
Pnm =b
3.2 LRCF Multiplication representation of the multiplier d such that d ¼ i¼1 Di ri ,
The left-to-right carry-free (LRCF) multiplication, intro- b
with Di 2 fr=2; . . . ; r=2g and r ¼ 2 .
duced in [5], produces the product digits from a redundant The recurrence produces a sequence of two accumulated
set in a most-significant-digit-first (MSDF) manner and products w and p as follows:
performs the conversion on-the-fly of the product generated
in a redundant form to the conventional form without using w½j þ 1 ¼ rðfractionðw½j þ aDjþ1 ÞÞ
a carry-propagate adder and without additional delay. The Kjþ1 ¼ integerðw½j þ aDjþ1 Þ ð17Þ
resulting implementation is fast and regular and is very j1
p½j þ 1 ¼ p½j þ Kjþ1 r ;
well-suited for VLSI implementations [14].
We adapt the LRCF multiplication to carry out the with j ¼ 1; . . . ; nm =b and initial values w½0 ¼ p½0 ¼ 0. The
intermediate multiplication in the high-radix powering, digits of the multiplier are used from most to least
utilizing its left-to-right operation flow, with redundant significant, unlike conventional multiplication schemes
representation of the operands and recoding to a high-radix which use a right-to-left mode. After nm =b steps, p½nm =b
of the multiplier operand. However, since the high-radix is the most significant part of the product, while w½nm =b is
exponential is of the online type, the digits produced by the least significant part.
A fast implementation of the LRCF multiplication
scheme requires: 1) use of a redundant adder (either CS
or SD) to compute the residual w½j and 2), since the
maximum value of Kj in (17) is, in general, larger7 than
ðr  1Þ, a recoding of Kj and Kjþ1 into Pj in the range
½ðr  1Þ; ðr  1Þ is necessary. That is, the resulting digit
will be Pj ¼ fðKj ; Kjþ1 Þ.
Regarding the implementation proposed in [5], the main
modification in the LRCF multiplication stage in the
architecture for powering computation is the use of a
multiply-add unit for computing the recurrences, instead of
separated redundant multiplier and adder.

3.3 Online High-Radix Exponential


An online high-radix algorithm for the computation of eF is
described in detail in [19], [24]. Some modifications and
optimizations have been made in the algorithm used here,
according to the operation flow in the optimized algorithm

Fig. 4. Block diagram of the LRCF multiplication stage. 7. The range of Kj is jKj j < 3r=2.

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
1090 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 9, SEPTEMBER 2004

Fig. 5. Block diagram of the online high-radix exponential stage.

for powering computation, and also a slightly different ^ e ½jÞ;


ej ¼ roundðW ð21Þ
notation is used.
with a digit set for ej of fðr  1Þ; . . . ; 1; 0; 1; . . . ; ðr  1Þg.
The algorithm for the computation of 2F is based on the ^ e ½j is obtained by truncating the signed-digit representa-
W
identity
Y  P tion of We ½j to t fractional bits.
2X ¼ hj 2X log2 ðhj Þ ; ð18Þ The use of selection by table look-up is required in the
first iteration. The table is addressed by the b most
with hj ¼ 1 þ ej rj , r ¼ 2b being the radix and ej a radix-r significant bits of the input operand F1 and the selection
digit. is performed so that ðr  3Þ  e2  ðr  2Þ, which results
The exponential is computed on the basis of partial in an overredundant first digit e1 . The convergence
information: conditions also determine a minimum value of t ¼ 2, an
online delay  ¼ 2, and a bound on the radix r  8, as
X

F ½0 ¼ Fj rj ; ð19Þ shown in [24].
j¼1
The look-up tables used in the first iteration are
addressed one cycle in advance, since F ½0 is already
with  the online delay and Fj the radix-r digits of the input known while Fþ1 is being computed, and the obtained
operand F . values can be stored in the registers reg e1 and reg log e1 .
The recurrences for performing the additive normalization The conversion from redundant to conventional repre-
of F and computing the exponential are [24]: sentation of the final result is performed using an on-the-fly
We ½j þ 1 ¼ rðWe ½j  rj log2 ð1 þ ej rj Þ þ Fjþ r Þ method [4], [6], avoiding the need for a carry-propagate
ð20Þ addition and an increase in neither the cycle time nor the
E½j þ 1 ¼ E½jð1 þ ej rj Þ; latency of the algorithm.
with j  1, We ½1 ¼ rF ½0, and E½1 ¼ 1. For a result
3.4 Implementation Details
precision of ne bits, a total number of Ne ¼ dne =be iterations
The main features in the implementation of our architecture
are necessary. An approximation ej r= lnð2Þ to the loga-
for powering computation are:
rithm constants can be used in iterations j  dNe =2e þ 1,
with a table storing ej = lnð2Þ, in order to reduce the overall . All variables are in redundant representation
hardware requirements of the algorithm. (signed-digit) to allow faster execution of iterations,
The block diagram of the online high-radix exponential since the additions become independent of the
stage is shown in Fig. 5, with double lines for SD operands, precision. A similar approach could be proposed
thin ones for single-digit operands, and thick lines for with the use of carry-save (CS) representation [15].
parallel operands. T ABðe1 r1 Þ and T ABð log e1 Þ are the . All products of the type Q½jrj and qj rj (with qj ¼
look-up tables storing e1 r1 and r2 log2 ð1 þ e1 r1 Þ, respec- lj or ej ) can be performed as shifts since r ¼ 2b .
tively, addressed by the b most significant bits of the input . SDA is a signed-digit binary adder with  input bit-
operand F . vectors. The addition of two SD operands requires
The selection of digits ej is done in iterations j  2 by an SDA4 adder since the input operands are
rounding an estimate of the residual: represented by two bit-vectors each. We use SDA3

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
~ EIRO ET AL.: ALGORITHM AND ARCHITECTURE FOR LOGARITHM, EXPONENTIAL, AND POWERING COMPUTATION
PIN 1091

TABLE 2
Execution Time and Total Area for Single-Precision Powering (n ¼ 24)

TABLE 3
Execution Time and Total Area for Double-Precision Powering (n ¼ 53)

adders for accumulating a SD operand and an widely used in technology-independent comparisons [8],
operand in two’s complement (2C) representation. [19], [22]. The units employed are the delay  of a complex
. An internal recoding from SD radix-2 to SD radix-4 gate, a full-adder, and the area of a 1-bit full-adder (fa). A
representation of the multiplier operand is per- detailed explanation of the model can be found in [19], [22].
formed in the multiply-add units to reduce by half Tables 2 and 3 show estimates of the latency, cycle time,
the number of partial products to be accumulated. execution time, area of the logarithm stage, area of the
However, when the multiplicand is represented in exponential stage, and total area of the proposed architec-
SD, the size of the unit is roughly double that of a ture, for single and double-precision computations, respec-
regular multiplier with nonredundant multiplicand tively, when implemented with radix values going from
and one extra level in the accumulation tree is r ¼ 8 to r ¼ 1; 024. The latency is given in cycles, the cycle
required. time and execution time are expressed in terms of , and the
. The round&rec units take the ðb þ tÞ most significant unit for the area estimates is fa. The dependence of both the
bits of the residuals Wl ½j and We ½j (t ¼ 2 in both execution times and the total area with the value of the
cases) and produce an SD-4 representation of the radix r is graphically represented in Fig. 6.
digits lj and ej , to be used as multipliers in the The latency, as explained in (6), is Ne þ  þ 4 cycles, with
multiply-add units.  ¼ 1 (with r  128 and  ¼ 2 cycles for lower radices) since
. The round&assim units take the same ðb þ tÞ input the online delay of the exponential stage is  ¼ 2 in this case.
words, but produce a two’s complement representa- The cycle time corresponds in most cases to the high-radix
tion of the coefficient with opposite sign (lj or ej ) logarithm stage, where the critical path can be either the one
by rounding the input word and performing its composed of the initial table look-up for selecting the digit
assimilation into nonredundant representation. This l1 , a multiplexer, the multiply-add unit and the register
2C representation is used for addressing the look-up Wl ½j, or the one consisting of the round&assim unit, the
tables storing the logarithm constants and as inputs look-up table storing the elementary logarithms, a 3:1 multi-
for the tables storing lj = lnð2Þ or ej = lnð2Þ. plexer, an SDA3 adder, and the register R½j.
The total area of our architecture consists of the
4 EVALUATION AND COMPARISON hardware requirements of the individual stages (high-radix
logarithm, LRCF multiplication, and online high-radix
In this section, we present estimates of the execution time
exponential, shown in Figs. 3, 4, and 5), plus the integer
and the area costs of the proposed architecture, for single
ðb  nexp Þ-bit multiply-add unit to compute the exponent
and double-precision floating-point powering computa- Ez , the control logic, and a b-bit CPA adder to assimilate F1
tions (n ¼ 24 and n ¼ 53 bits) with radix values r ¼ 2b in into conventional representation before addressing the
the range from r ¼ 8 to r ¼ 1; 024. These estimates are based initial tables in the online high-radix exponential stage.
on an approximate model for the cost and delay of the main The analysis of the obtained results can be summarized
logic blocks used. in the following points:
The actual delays and area costs depend on the
technology used and on the actual implementation. How- . For radix values r  64, the cycle time only depends
ever, this model provides a good first-order approximation on the radix, and not on the target precision, due to
to the actual execution time and area values and has been the use of redundant arithmetic.

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
1092 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 9, SEPTEMBER 2004

Fig. 6. Execution time and area cost estimates for the proposed architecture. (a) Single-precision powering. (b) Double-precision powering.

Fig. 7. Delay and area estimates for a high-radix CORDIC computation (n ¼ 32-bits).

. The main contribution to the total area for the . Using the identity XY ¼ 2Y log2 ðXÞ instead of XY ¼
proposed architecture comes from the logarithm stage eY lnðXÞ allows reducing the overall latency by one
due to the extra iteration (or two extra iterations for cycle regarding the algorithm proposed in [19], [23]
lower radices) necessary to obtain the final n-bit due to the elimination of the second LRCF multi-
precision.8 This extra iteration(s) result in bigger plication. Moreover, an extra multiplication YL ¼
shifters and look-up tables for the logarithm stage. Y log2 ðeÞ is also eliminated at the expense of
. The main contribution to the total area for the requiring two extra look-up tables for storing
proposed architecture comes from the combinational lj = ln 2 and ej = ln 2. Since the size of these extra
logic for small radices, but the area of the look-up tables grows exponentially with the radix r, the
tables becomes the primary factor for r ¼ 128 in impact in the total area is bigger for the higher radix
single-precision computations and for r ¼ 32 in values. However, the overall area requirements of
double-precision computations due to the exponen- our implementation are lower than or similar to
those of the algorithm proposed in [19], [23] for any
tial growth in the table size with the radix.
radix value r  128.
8. A precision of around nl ¼ n þ b or nl ¼ n þ 2b bits is typically . The cycle time increases with the radix, but the
necessary, as shown in Section 2.2. latency decreases. A good trade off between latency

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
~ EIRO ET AL.: ALGORITHM AND ARCHITECTURE FOR LOGARITHM, EXPONENTIAL, AND POWERING COMPUTATION
PIN 1093

TABLE 4
Area Estimates for the Tables Used in the HR Vectoring CORDIC Implementation [1]

and cycle time can be achieved for specific values of adder, 2:1 multiplexer, and register.9 The execution time can
r, leading to low execution times. The area require- be therefore estimated as 115, similar to the execution time
ments must also be taken into account when trying of the powering algorithm (about 100). However, the total
to determine the most efficient implementations. area of the implementation proposed in [1] is about
According to Tables 2 and 3 and Fig. 6, the best trade
18; 000fa, more than 3 times the area estimate of our
offs correspond to the radix values highlighted in the
tables and the charts. Thus, in applications demand- powering algorithm. This is due to the huge size of the
ing high-speed processing, the most efficient im- tables employed (around 500Kbits), as shown in Table 4.
plementations correspond to radix r ¼ 128, while, The conclusions drawn from the analysis of the trade offs
for all other applications, implementations with between execution time and area costs of our architecture
radix r ¼ 32 may be preferable due to their lower for powering computation can be applied to high-radix
area requirements. CORDIC as well and, therefore, we have included, in Fig. 7,
The main conclusion that can be drawn from this the delay estimates for an implementation with radix
analysis is that little advantage, or no advantage at all, is r ¼ 128. The main paths in this implementation are more
obtained from using very-high radix values such as r ¼ 256, compensated and the cycle time is significantly reduced
r ¼ 512, or r ¼ 1; 024 because the execution times are from 11:5 to 10. However, the latency in this case is
similar to those achieved with r ¼ 128, but the area higher since 12 cycles are necessary for computing the
requirements are much higher. modulus (five cycles for microrotations, three cycles for the
two prescaling operations, and five cycles for scaling factor
4.1 Comparison with High-Radix CORDIC compensation, the first one overlapped with the last
For the sake of reference, we now give execution time and microrotation). The execution time can therefore be esti-
mated as 120, only 5 slower than the r ¼ 512 implemen-
hardware requirement estimates for a high-radix vectoring
tation, but the table size is now 142Kb, leading to a total
CORDIC implementation [1]. This algorithm has been area about 6; 000fa, similar to the area estimate of our
chosen as a reference because it allows the computation of algorithm (5; 964fa), and three times smaller than that of the
two-variable functions (for instance, the modulus of a implementation with radix r ¼ 512.
2D vector) and high-radix algorithms have been proposed Table 5 summarizes the comparison between the
for its acceleration. To allow for a fair comparison, the same proposed architecture for powering computation and the
assumptions are made for both algorithms, redundant CORDIC algorithm proposed in [1], for fixed-point n ¼
representation is used for the variables, and we consider a 32-bit computations, showing the latency, cycle time,
fixed-point implementation of our powering algorithm for execution time, and total area estimates in both cases. We
n ¼ 32 (details in [19]) since the algorithm proposed in [1] have included the implementations of high-radix CORDIC
performs n ¼ 32-bit computations. The main features of with both radix r ¼ 128 and r ¼ 512 to mark the importance
such an algorithm are the use of radix r ¼ 512, with a small of the conclusions drawn from the analysis performed.
radix R ¼ 32 for the first CORDIC microrotation, and two We can conclude that, with our algorithm, the powering
scaling operations to guarantee the convergence of the function can be computed efficiently in dedicated hardware
algorithm. with area cost and performance comparable to those of the
When computing the modulus, the r ¼ 512 CORDIC evaluation of other elementary functions.
algorithm requires four microrotations, two scaling opera-
tions (taking three cycles due to the complexity of the 5 COMPUTATION OF LOGARITHM AND EXPONENTIAL
second scaling), and four extra iterations for compensating
The logarithm and exponential operations are included in
the scaling factor. This results in a latency of 10 cycles the operation flow of our algorithm for powering computa-
(4 þ 3 þ 3 since the first extra iteration can be overlapped tion in such a way that it is not possible to directly obtain
with the last CORDIC microrotation). According to the them as an output result. However, some modifications can
approximate model used, the critical path has a delay of be made to the proposed architecture to allow the
11:5, as shown in Fig. 7. The critical path in this case
consists of the round&assim unit, module computation, table 9. In the estimates presented in [1], the registers and the look-up tables
are not included, but we have shown [19] that, in this type of architecture,
storing the logarithms of the elementary angles (required such tables determine the execution time since they belong to the slowest
for scale factor compensation), 5:1 multiplexer, 3:2 CSA paths in the architecture.

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
1094 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 9, SEPTEMBER 2004

TABLE 5
Comparison with High-Radix CORDIC Vectoring Algorithm (n ¼ 32 Bits)

Fig. 8. Block diagrams of the modified architecture. (a) Logarithm and LRCF multiplication. (b) Online exponential.

independent computation of the three operations: loga- logarithm (taking the output operand from the on-the-fly
rithm, exponential, and powering. The resulting unit is conversion unit at the output of the logarithm stage) and the
more versatile than the original one and becomes an computation of the exponential and powering (in both cases,
interesting alternative for elementary function computation
the output operand is the original on-the-fly conversion of the
in digital signal processing (DSP), computer graphics, and
other applications. output of the exponential stage). According to Fig. 8, the
Fig. 8 shows the main modifications to be performed to computation of the logarithm admits two combinations of the
the previously proposed architecture to allow for this extra control signals opP OW and opLE: “00” and “10,” while, for
functionality. These modifications only affect the inputs and the computation of the exponential, they must take the value
outputs of the main stages, whose internal composition “11” and, for powering computation, a control signal “01” is
remains unchanged. The extra multiplexers to be inserted necessary.
are highlighted within a dashed box in Fig. 8b. Also, in this The computation of logarithm and exponential has a
figure, app_reg stands for an append register, which is lower latency than the computation of powering and,
necessary in the computation of the powering since therefore, the control unit must also allow the completion
r2 F ½0 ¼ r2 ðF1 þ F2 Þ. of the operation in fewer cycles than the original case.
A summary of the control decisions to be taken depending Table 7 shows the latency values for the computation of the
on the operations to be computed is shown in Table 6. Two three operations when performing single and double-
control signals suffice for selecting the correct behavior of the precision computations. Only implementations with radix
architecture: opP OW and opLE. The signal opP OW is r ¼ 32 and r ¼ 128 for single-precision and r ¼ 32 and 128
responsible for selecting between including the exponential for double-precision, are considered since they have been
stage in the powering operation flow or using it as an shown in Section 4 to be the most efficient ones for
independent unit with parallel inputs and output. In order to powering computation.
do this, opP OW must properly set the inputs of the look-up The latency of the exponential computation is one cycle
tables T ABðe1 r1 Þ, T ABð log e1 Þ, the input parameter higher than that of the logarithm since, in this case, the first
r2 F ½0, and the digits Fjþ rþ1 . For logarithm computation, cycle is employed in addressing the initial look-up tables
the values of these signals are not relevant. The second control T ABðe1 r1 Þ, T ABð log e1 Þ, and the first iteration of the
signal, opLE, must distinguish between the computation of algorithm is completed in the second cycle.

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
~ EIRO ET AL.: ALGORITHM AND ARCHITECTURE FOR LOGARITHM, EXPONENTIAL, AND POWERING COMPUTATION
PIN 1095

TABLE 6
Computations Performed in the Complete Unit Depending on the Operation

6 POWERING COMPUTATION WITH The new hardware requirements in the architecture to


EXPONENTS Y ¼ 1=q allow for the computation of the powering function with
exponents of the type Y ¼ 1=q are a look-up table where the
Our algorithm for powering computation can be extended
values 1=q are stored with high-accuracy (if only a subset of
for the computation of exponents of the type Y ¼ 1=q, with
q integer (the qth root extraction), as shown next: q values is interesting for the application, a significant area
reduction can be achieved since a small table can be used),
Ex Ex
X Y ¼ 21=q log2 ðMx 2 Þ
¼ 21=qðlog2 ðMx Þþlog2 ð2 ÞÞ the table to store the values 2 , and the unit to perform the
ð22Þ modulus operation. The term bEx =qc can be computed as
¼ 21=q½log2 ðMx ÞþEx  ¼ 21=q log2 ðMx Þ 21=qEx : Ex  1=q, taking the integer part of the result, once the table
The first part of the equation can be handled in the same look-up of 1=q has been performed.
way as in the original powering algorithm, taking the
integer and fractional parts I 0 and F 0 of the product 7 CONCLUSION
1=q log2 ðMx Þ. In this case, the integer part is always I 0 ¼ 0
An architecture for the computation of the logarithm,
since 1=q < 1 and log2 ðMx Þ < 1:
exponential, and powering functions has been presented
0
21=q log2 ðMx Þ ¼ 2F ; ð23Þ in this paper, based on a high-radix composite algorithm for
computing the powering X Y with integer exponent Y . The
which allows reducing the overall latency of the algorithm algorithm consists of computing XY as 2Y log2 ðMx Þ 2Y Ex ,
by one cycle.
through a sequence of overlapped operations: logarithm,
However, the term 21=qEx must be handled differently
multiplication, and exponential.
since the exponent is not an integer. In this case, an integer
The computation of the logarithm is carried out by a high-
division and modulus operation can be performed in order
radix digit-recurrence unit, with selection by rounding. The
to again extract the integer part of Ex =q:
intermediate multiplication is computed by high-radix left-
to-right carry-free (LRCF) multiplier. Finally, the exponential
21=qEx ¼ 2bEx =qc 2 ; ð24Þ
is computed by an online high-radix unit with selection by
with ¼ Ex %q. The output of the algorithm is therefore, rounding. The computation of the exponent of the result is
according to (22) and (24), the floating-point result: carried out in parallel by an integer multiply-add unit.
A sequential architecture implementing our algorithm
0
X Y ¼ Mz 2Ez ¼ 2 2F 2bEx =qc : ð25Þ has been proposed and estimates of the execution times and
hardware requirements have been obtained, based on an
F0
The term Mz ¼ 2 2 can be obtained as the output of the approximate model for the delay and the area of the main
exponential stage since the exponential algorithm computes components, for single and double-precision floating-point
e when the input argument is  and E½1 ¼  is taken as computations with radix values going from r ¼ 8 to
the initial value for the E½j recurrence [19], [24] instead of r ¼ 1; 024. The analysis of the trade offs between area and
E½1 ¼ 1. For any small q, a table can be employed to store speed in the implementation of this architecture shows that
the most efficient implementation corresponds to radix
the q values 2 , which go from 20 ; 21=q ; . . . to 2q1=q and then
r ¼ 32, although an implementation with radix r ¼ 128 may
the appropriate value can be used to load and initialize the be suitable when very high-speed processing is necessary,
register E½j as 2 , depending on the result of the modulus at the expense of greater hardware requirements. There is
operation . The exponent term Ez is 2bEx =qc , with an integer no advantage in the use of a very-high radix value such as
argument in the exponential. r ¼ 256, r ¼ 512, or r ¼ 1; 024 since the execution times are
similar to those achieved with r ¼ 128, but the area
requirements are much higher.
TABLE 7 A comparison with a high-radix CORDIC architecture
Latency for the Computation of the Three Operations shows that the execution times and hardware requirements
of the unit computing the powering are similar to those of
other iterative algorithms for the computation of elementary
functions.
Finally, the proposed architecture has been modified to
allow the independent computation of logarithm and
exponential, with lower latencies than those of powering
and our algorithm has been extended to powering

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.
1096 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 9, SEPTEMBER 2004

computation with exponents of the type Y ¼ 1=q (i.e., the [20] J.-A. Piñeiro and J.D. Bruguera, “High-Speed Double-Precision
Computation of Reciprocal, Division, Square Root and Inverse
qth root extraction). Square Root,” IEEE Trans. Computers, vol. 51, no. 12, pp. 1377-1388,
Dec. 2002.
[21] J.-A. Piñeiro, J.D. Bruguera, and J.-M. Muller, “Faithful Powering
ACKNOWLEDGMENTS Computation Using Table Look-Up and Fused Accumulation
The authors would like to thank the anonymous reviewers Tree,” Proc. IEEE 15th Int’l Symp. Computer Arithmetic, pp. 40-47,
2001.
for their suggestions and contributions, which have led to [22] J.-A. Piñeiro, M.D. Ercegovac, and J.D. Bruguera, “High-Radix
some optimizations and additional results. This work was Logarithm with Selection by Rounding,” Proc. IEEE 13th Int’l Conf.
Application-Specific Systems, Architectures and Processors (ASAP ’02),
developed while J.-A. Piñeiro was with the University of pp. 101-110, July 2002.
California, Los Angeles (UCLA). J.-A. Piñeiro and [23] J.-A. Piñeiro, M.D. Ercegovac, and J.D. Bruguera, “High-Radix
J.D. Bruguera were partially supported by the Spanish Iterative Algorithm for Powering Computation,” Proc. IEEE 16th
Int’l Symp. Computer Arithmetic (ARITH16), pp. 204-211, June 2003.
Ministry of Science and Technology (MCyT-FEDER) under [24] J.-A. Piñeiro, M.D. Ercegovac, and J.D. Bruguera, “On-Line High-
contract TIC2001-3694-C02. Radix Exponential with Selection by Rounding,” Proc. IEEE Int’l
Symp. Circuits And Systems (ISCAS 2003), May 2003.
[25] M.J. Schulte and J.E. Stine, “Approximating Elementary Functions
with Symmetric Bipartite Tables,” IEEE Trans. Computers, vol. 48,
REFERENCES no. 8, pp. 842-847, Aug. 1999.
[1] E. Antelo, T. Lang, and J.D. Bruguera, “Very-High Radix CORDIC [26] H.C. Shin, J.A. Lee, and L.S. Kim, “A Minimized Hardware
Vectoring with Scalings andSelection by Rounding,” Proc. 14th Architecture of Fast Phong Shader Using Taylor Series Approx-
Int’l Symp. Computer Arithmetic (ARITH14), pp. 204-213, Apr. 1999. imation in 3D Graphics,” Proc. Int’l Conf. Computer Design, VLSI in
[2] J.-C. Bajard, S. Kla, and J.-M. Muller, “BKM: A New Hardware Computers and Processors, pp. 286-291, 1998.
Algorithm for Complex Elementary Functions,” IEEE Trans. [27] N. Takagi, “Powering by a Table Look-Up and a Multiplication
Computers, vol. 43, no. 8, pp. 955-964, Aug. 1994. with Operand Modification,” IEEE Trans. Computers, vol. 47,
[3] W. Cody and W. Waite, Software Manual for the Elementary no. 11, pp. 1216-1222, Nov. 1998.
Functions. Prentice Hall, 1980. [28] P.T.P. Tang, “Table Look-Up Algorithms for Elementary Func-
[4] M.D. Ercegovac and T. Lang, “On-the-Fly Conversion of Re- tions and Their Error Analysis,” Proc. 10th IEEE Symp. Computer
dundant into Conventiona lRepresentations,” IEEE Trans. Compu- Arithmetic, pp. 232-236, June 1991.
ters, vol. 36, no. 7, pp. 895-897, July 1987. [29] M. Zhang, S. Vassiliadis, and J.G. Delgado-Frias, “Sigmoid
[5] M.D. Ercegovac and T. Lang, “Fast Multiplication without Carry Generators for Neural Computing Using Piecewise Approxima-
Propagate Addition,” IEEE Trans. Computers, vol. 39, no. 11, tions,” IEEE Trans. Computers, vol. 45, no. 9, pp. 1045-1050, Sept.
pp. 1385-1390, Nov. 1990. 1996.
[6] M.D. Ercegovac and T. Lang, “On-the-Fly Rounding,” IEEE Trans.
Computers, vol. 41, no. 12, pp. 1497-1503, Dec. 1992. Jose-Alejandro Piñeiro received the PhD
[7] M.D. Ercegovac and T. Lang, Division and Square Root: Digit degree in computer engineering in 2003 and
Recurrence Algorithms and Implementations. Kluwer Academic, the MSc degree (1999) and BSc degree (1998)
1994. in physics (electronics), from the University of
[8] M.D. Ercegovac, T. Lang, J.-M. Muller, and A. Tisserand, Santiago de Compostela, Spain. In February
“Reciprocation, Square Root, Inverse Square Root, and Some 2004, he joined Intel Barcelona Research
Elementary Functions Using Small Multipliers,” IEEE Trans. Center, Intel Labs-UPC, whose research fo-
Computers, vol. 49, no. 7, pp. 628-637, July 2000. cuses on new microarchitectural paradigms and
[9] C.T. Fike, Computer Evaluation of Mathematical Functions. Prentice code generation techniques for both IA-32 and
Hall, 1968. IPF families. His research interests are also in
[10] S. Gal and B. Bachelis, “An Accurate Elementary Mathematical the area of computer arithmetic, VLSI design, and numerical processors.
Library for the IEEE Floating Point Standard,” ACM Trans. Math.
Software, vol. 17, no. 1, pp. 26-45, Mar. 1991. Milos D. Ercegovac received the MS (1972) and
[11] R.E. Goldschmidt, “Applications of Division by Convergence,” PhD (1975) degrees in computer science from
MS thesis, Electrical Eng. Dept., Massachussets Inst. of Technol- the University of Illinois, Urbana-Champaign, and
ogy (MIT), June 1964. the BS degree in electrical engineering (1965)
[12] D. Harris, “A Powering Unit for an OpenGL Lighting Engine,” from the University of Belgrade, Yugoslavia. He
Proc. 35th Asilomar Conf. Signals, Systems, and Computers, pp. 1641- is a professor and chair of the University of
1645, 2001. California at Los Angeles Computer Science
[13] J.F. Hart, E.W. Cheney, C.L. Lawson, H.J. Maehly, C.K. Mesztenyi, Department. He specializes in research and
J.R. Rice, H.G. Thacher, and C. Witzgall, Computer Approximations. teaching in digital arithmetic, digital design, and
New York: Wiley, 1968. computer system architecture. His research
[14] R.K. Kolagotla, H.R. Srinivas, and G.F. Burns, “VLSI Implementa- contributions have been extensively published in journals and conference
tion of a 200-MHz 16*16 Left-to-Right Carry-Free Multiplier in proceedings. He is a coauthor of two textbooks on digital design and of a
0.35 m CMOS Technology for Next-Generation DSPs,” Proc. IEEE monograph in the area of digital arithmetic. Dr. Ercegovac has been
1997 Custom Integrated Circuits Conf., pp. 469-472, 1997. involved in organizing the IEEE Symposia on Computer Arithmetic since
[15] P. Kornerup, “Reviewing 4-to-2 Adders for Multi-Operand 1978. He served as an editor of the IEEE Transactions on Computers and
Addition,” Proc. IEEE 13th Int’l Conf. Application-Specific Systems, as a subject area editor for the Journal of Parallel and Distributed
Architectures and Processors (ASAP ’02), pp. 218-229, 2002. Computing. He is a fellow of the IEEE and a member of the ACM.
[16] D.M. Lewis, “114 MFLOPS Logarithmic Number System Arith-
metic Unit for DSP Applications,” IEEE J. Solid-State Circuits, Javier D. Bruguera received the BSc degree in
vol. 30, no. 12, pp. 1547-1553, 1995, physics and the PhD degree from the University
[17] J.-M. Muller, Elementary Functions. Algorithms and Implementation. of Santiago de Compostela, Spain, in 1984 and
Birkhauser, 1997. 1989, respectively. Currently, he is a professor
[18] S.F. Oberman, “Floating Point Division and Square Root Algo- with the Department of Electronic and Computer
rithms and Implementation in the AMD-K7 Microprocessor,” Engineering at the University of Santiago de
Proc. 14th Symp. Computer Arithmetic (ARITH14), pp. 106-115, Apr. Compostela. His research interests are in the
1999. area of computer arithmetic, VLSI design for
[19] J.-A. Piñeiro, “Algorithms and Architectures for Elementary signal and image processing, and parallel
Function Computation,” PhD dissertation, Dept. of Electronics architectures. He is a member of the IEEE.
and Computer Eng., Univ. Santiago de Compostela, June 2003,
http://www.ac.usc.es/~alex.

Authorized licensed use limited to: Univ of Calif Los Angeles. Downloaded on December 8, 2008 at 00:51 from IEEE Xplore. Restrictions apply.

You might also like