Integration, The VLSI Journal: Ankur Changela, Mazad Zaveri, Deepak Verma

Integration, the VLSI Journal 73 (2020) 89–100
Contents lists available at ScienceDirect
Integration, the VLSI Journal

journal homepage: www.elsevier.com/locate/vlsi
FPGA implementation of high-performance, resource-efficient Radix-16

CORDIC rotator based FFT algorithm
Ankur Changela a, ∗ , Mazad Zaveri b , Deepak Verma c
a
Institute of Technology and Engineering, Indus University, Ahmedabad, Gujarat, India
b
Pandit Deendayal Petroleum University, Department of ICT, Gandhinagar, Gujarat, India
c School of Engineering and Applied Science, Ahmedabad University, Ahmedabad, Gujarat, India
A R T I C L E I N F O A B S T R A C T
Keywords: The fast Fourier transform (FFT) is an algorithm widely used to compute the discrete Fourier transform (DFT) in
CORDIC algorithm real-time digital signal processing. High-performance with fewer resources is highly desirable for any real-time
FFT
application. Our proposed work presents the implementation of the radix-2 decimation-in-frequency (R2DIF)
Double-path delay commutator (DDC)
FFT algorithm based on the modified feed-forward double-path delay commutator (DDC) architecture on FPGA
Canonical signed digit (CSD)
Redundant arithmetic
device. Need for a complex multiplier to carry out the multiplication of complex twiddle factors and large mem-
FPGA ory to store the twiddle factors are the main concerns for FFT implementation. Propose work aims to address
these issues. In this work, a high-performance radix-16 COordinate Rotational DIgital Computer (CORDIC) algo-
rithm based rotator is proposed to carry out the complex twiddle factor multiplication. Further, CORDIC needs
only rotational angles to carry out complex multiplication, which reduces the need for large memory to store
the twiddle factors. To compute the total rotation for n-bit precision, our proposed radix-16 CORDIC algorithm
takes n∕4 iteration as compared to n iteration of the radix-2 CORDIC algorithm. Our proposed architecture
of the radix-2 decimation-in-frequency (R2DIF) algorithm is implemented on a Virtex−7 series FPGA. Further,
the detailed comparison is presented between our proposed FFT implementation and other recently proposed
FFT implementations. Experimental results suggest that proposed implementation has less latency and hardware
utilization as compared to recently proposed implementations.
1. Introduction of the FFT algorithm is very low (O(Nlog(N))), it is suitable for the VLSI
implementation.
The discrete Fourier transform (DFT) performs a crucial role in the Modern real-time applications demand high-throughput, low-
field of digital signal processing [1–3]. DFT is used to design, analyze, latency, and less hardware utilization for efficient FFT implementation
and implement many digital signal processing (DSP) algorithms and [12]. To meet these demands, pipelined architectures [1,6,7,13–15] are
systems [2,4,5]. DFT has very high computation complexity (O(N2 )) widely used by hardware designers. The pipelined architecture allows
[6,7] and its VLSI implementation is not practical. FFT is an algorithm all stages to compute in parallel, which results in high-throughput and
to compute the N-point DFT efficiently. FFT algorithm given by Cooly- low-latency. Feed-forward (FF) [4,7,11,12,16,17] and feed-back (FB)
Tukey is mostly used to compute the N-point DFT when N is the integer [4,6,18,19] are the two main types of pipeline architectures. In feed-
power of two [8]. DIT (decimation-in-time) and DIF (decimation-in- forward architecture, data flows from the current stage to the next stage
frequency) are two most widely used FFT algorithms to compute the continuously, whereas few outputs of the butterfly stage are fed back
DFT of an N-point input sequence. The FFT algorithm decreases the to the memories at the same stage in feed-back architecture. Memory-
computational complexity of DFT to O(Nlog(N)) by decomposing the based iterative architectures have low resource requirements as com-
computation of DFT of a sequence of length N into successively smaller pared to pipelined architectures, but they have a problem of large
DFT and taking advantage of the periodicity and symmetry properties of latency [9,12].
the complex twiddle factor [9–11]. Since the computational complexity In this work, high-performance, and resource-efficient, FPGA imple-
mentation of the radix-2 DIF FFT algorithm is proposed. To improve the
∗ Corresponding author.
E-mail addresses: ankurchangela.ec@indusuni.ac.in (A. Changela), zaverimazad@gmail.com (M. Zaveri), deepak.verma@ahduni.edu.in (D. Verma).
https://doi.org/10.1016/j.vlsi.2020.03.008
Received 17 November 2019; Received in revised form 20 February 2020; Accepted 31 March 2020
Available online 16 April 2020
0167-9260/© 2020 Elsevier B.V. All rights reserved.
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
performance, our proposed radix-16 CORDIC algorithm based rotator sented in section 4. Section 5 concludes the proposed work.
and modified parallel pipelined double-path delay commutator archi-
tecture are used. The main component of the FPGA implementation of 2. The proposed radix-16 CORDIC algorithm
the FFT algorithm is a butterfly engine that uses two complex adders
and one complex multiplier. Further, four real adders and three real In a CORDIC algorithm, a rotating vector is rotated in a two-
multipliers are required to implement the complex multiplier, and the dimensional plane to achieve the total rotation for the specified angle.
complex adder can be implemented using two real adders. Proposed The given angle is divided into a series of random angles. However,
FFT implementation aims to reduce the arithmetic resources by using these arbitrary angles are chosen in such a way that the rotation of a
our proposed low-latency, high-performance radix-16 CORDIC algo- vector can be calculated by simple binary add and shift operations. In
rithm based rotator. CORDIC is a rotational algorithm that unites the the CORDIC algorithm, vector rotation is performed along with three
theory of rotational geometry in a two-dimensional plane along with coordinate systems: circular, hyperbolic, and linear. CORDIC processor
three coordinate systems [20–22]. The proposed radix-16 CORDIC rota- can be configured to evaluate the trigonometric and hyperbolic func-
tor is configured in circular rotation mode to multiply the complex tions by selecting a suitable mode of operation and coordinate system
twiddles with samples, which removes the need for the complex mul- [22,27].
tiplier. The radix-2 CORDIC algorithm has a drawback of high latency Radix of the CORDIC algorithm indicates the number of bits it can
as it takes 5n
4
iterations for n-bit precision to compute the total rota- process in one iteration. Radix-2n CORDIC algorithm can process n-bit
tion with scale-factor compensation [20]. The radix-4 CORDIC algo- in a single iteration. To speed up the computation, higher radix CORDIC
rithm has decreased the iterations to 2n to calculate the complete rota- may be employed to process more data bits in a single iteration, which
tion without scale factor compensation [23]. The double-step branch- also results in a reduction of latency. Radix-16 CORDIC algorithm can
ing CORDIC algorithm presented in Ref. [24] has decreased the total process 4-bits of data in one iteration. For rotation mode, we propose
iterations to n+2 3 to calculate the complete rotation. Hybrid CORDIC the following radix-16 CORDIC algorithm iteration.
algorithm reported that in Ref. [25] needs 3n + 1 iterations to mea-
8 xi+1 = xi − 𝜎i 16−i yi
sure the total rotation with scale factor compensation, but hardware
complexity is very high. Recently reported radix-8 CORDIC algorithm yi+1 = yi + 𝜎i 16−i xi
requires 3n iterations to evaluate the complete rotation and presented
the trade-off between hardware complexity and iterations required to zi+1 = zi − tan−1 (𝜎i 16−i ) (1)
achieve a certain precision [21]. Our proposed radix-16 CORDIC algo- ∏√
rithm is implemented with simple hardware as its VLSI implementation K= 1 + 𝜎i2 16−2i
needs only adders and shifters. Further, the radix-16 CORDIC algorithm i≥0
needs 4n iterations to find out the total rotation with two supplementary In radix-16 CORDIC algorithm, the rotation angle is divided into a series
iterations to compensate the scale factor for n-bit accuracy as compared of basic angles, i.e. tan−1 (𝜎 i 16−i ) where 𝜎 i represents the selection func-
to n iterations of the radix-2 CORDIC algorithm. The selection function tion. The selection function of radix-16 CORDIC algorithm can have any
of the radix-16 CORDIC algorithm is complex as it is not an integer value from 17 values i.e. 𝜎 i ∈ {−8, −7, … , 0, … , 7, 8} as compared to
power of two. This problem can be solved by adding the multiplexer in simple selection function of radix-2 CORDIC algorithm i.e. 𝜎 i ∈ {−1, 1}.
the data path. Radix-16 CORDIC rotator needs only twiddle angles to The problem of complex selection function can be solved by adding
multiply the twiddle factors with samples, which removes the need for a 4-to-1 multiplexer in the data path and using redundant arithmetic
large memory storage. The canonical signed digit (CSD) representation (carry-save). Detail of this implementation is discussed in the section
is used to represent the precomputed twiddle factors, which reduce the on architecture. From (1), it can be seen that the scale factor is not
partial product and hardware needed to add the partial products. a constant, and its value ranges from 1 to 8.81 and depends on two
Many researchers have tried to reduce the hardware by either reduc- parameters, iteration being executed and selection function.
ing the arithmetic resources or by increasing the utilization of arith-
metic resources. M. Ayinala et al. have reduced the area and power
2.1. A proof of convergence in rotation mode
by using register minimization technique and folding transformation
[14]. J. Wang et al. have presented a mixed-decimation multi-path
For the particular value of selection function 𝜎 i , we have to show
delay feed-back approach for the radix-2k FFT in Ref. [13]. They have
that variable z is limited in each iteration to prove the convergence
integrated the DIT operations into DIF operated computing units. Using
of the proposed CORDIC algorithm. To verify the convergence, the ini-
the concept of folding transformation, they have increased the arith-
tial value of the input angle z0 is assumed between − 𝜋∕2 to 𝜋∕2 i.e.
metic resources utilization and reduced the computing delay at the cost
z0 ∈ [−𝜋∕2, 𝜋∕2]. The theory of SRT division showed in Refs. [28,29]
of complex control logic. M. Sanchez has presented a detailed study
is applied to conclude the criteria for the selection function 𝜎 i . Accord-
of the FFT architectures and implemented the simplified monobit FFT
ing to a method presented in Refs. [28,29], we define a variable wi as
algorithm to maximize the throughput and to minimize the area [4].
follows:
Z. Wang et al. have implemented the combined single delay commuta-
tor (SDC), and single delay feed-back (SDF) pipelined FFT processor to wi = 16i zi
reduce the complex arithmetic resources but had a disadvantage of low-
To transform the variable zi+1 given in (1) into variable wi+1 , zi+1 is
throughput and high-latency [15]. Many researchers have presented a
multiplied with 16i+1 and it is rearranged as follows:
higher radix FFT algorithm. However, the VLSI implementation of these
algorithms needs complex hardware [6,7,26]. 16i+1 zi+1 = 16i+1 zi − 16i+1 tan−1 (𝜎i 16−i )
The rest of the paper has been organized as follows: Section 2 is
on the proposed high-performance, low-latency radix-16 CORDIC algo- wi+1 = 16(wi − Pi [𝜎i ]) (2)
rithm. The proof of convergence, scale factor compensation, calcula-
Pi [𝜎i ] = 16i tan−1 (𝜎i 16−i )
tions of selection function, and fully pipelined unrolled architecture of
the radix-16 CORDIC algorithm are also presented in section 2. The The recursive equation of wi , given in (2), is similar to the recursive
details of the FPGA implementation of the proposed radix-2 DIF FFT equation of the partial remainder of the high radix SRT division. For
algorithm is presented in section 3. Simulation results and compari- convergence of the radix-16 CORDIC algorithm, it is necessary that after
son of proposed implementation with existing implementations are pre- each iteration, the variable wi+1 is bounded in the interval. The method
for the radix-8 SRT division presented in Ref. [21] is used to derive the
90
method for the radix-16 SRT division. Two bound values, lower bound Table 1
M and upper bound N, are defined as follows: Decomposition of complex selection
function.
Mi [q] = Pi [q] − (8∕15)Pi [1]
Selection Generated partial products
Ni [q] = Pi [q] + (8∕15)Pi [1] (3) Function Multiplexer-1 Multiplexer-2
where q ∈ {±8, ±7, ±6, ±5, ±4, ±3, ±2, ±1, 0}. In each iteration, there ±8 ±8xi 0
±7 ±8xi ∓xi
must be a selection function, 𝜎 i = q, for which wi is limited in an
±6 ±4xi ±2xi
interval as follows: ±5 ±4xi ±xi
±4 ±4xi 0
Mi [q] ≤ wi ≤ Ni [q] (4) ±3 ±2xi ±xi
±2 ±2xi 0
Interval, given in (4), must be valid for at least one value of 𝜎 i = q. ±1 ±xi 0
For the algorithm to be converged, continuity of interval given in (4) 0 0 0
has to be verified. This verification is carried out for all iterations using
Matlab. To prove the convergence of the algorithm, 𝜎 i has to be selected
such a way that wi is bounded in interval given by
The selection function for i > 1 can be made independent of iter-
|wi | ≤ Pi [8] + (8∕15)Pi [1] (5) ation being executed. To achieve this condition, the maximum value
of the lower limit (M) and the minimum value of the upper limit (N)
The method of induction is used to prove the convergence of the radix-
are chosen as this makes overlapping between intervals independent
16 CORDIC algorithm. The proof is carried out in two parts. In the first
of the iteration index. Since limits M and N are monotonically increas-
part, the algorithm is verified for i = 0, and later algorithm is verified
ing function, the maximum value of the lower limit can be obtained
for i ≥ 1. Assuming the initial value of the input angle |w0 | ≤ 𝜋∕2,
when i = ∞ i.e., M∞ [q] and minimum value of the upper limit can be
from (5),
obtained when i = 2 i.e., N2 [q].
|w0 | ≤ P0 [8] + (8∕15)P0 [1] ≈ 1.85 The overlapping area between the intervals is shown in Fig. 1. The
criteria for selecting the selection function is derived separately for
Hence (5) is verified for i = 0. To prove the convergence for i ≥ 1, i = 0, i = 1, and i > 1 as given in Table 2. However, the case when
assuming that (5) is verified for i = n − 1 then (5) has to be proved i > 1 is discussed here. Selection criteria for i = 0 and i = 1 can be
valid for i = n. The criteria given in (4) can be rewritten for i = n − 1 obtained in a similar way. To find out the selection criteria for 𝜎 i = 1,
as below, first intervals for selecting 𝜎 i = 1 and 𝜎 i = 2 have to be found out. As
shown in Fig. 1, the interval for selecting 𝜎 i = 1 is 0.46 ≤ wi < 1.53
Mn−1 [q] ≤ wn−1 ≤ Nn−1 [q]
and interval for selecting 𝜎 i = 2 is 1.46 ≤ wi < 2.53. Hence for select-
To obtain the inequality given in (6), first M and N are replaced with ing 𝜎 i = 1, the value of wi has to be chosen from the overlapping
their values given in (3). Later expression is subtracted by Pi [q] and between these two intervals i.e. 1.46 ≤ wi < 1.53. Similarly, over-
multiplied by 16. lapping between two intervals for selecting 𝜎 i = 1 and 𝜎 i = 0 is
0.46 ≤ wi < 0.53. Using these obtained results, the range of wi for
128 128
− P [1] ≤ 16(wn−1 − Pn−1 [q]) ≤ P [1] (6) selecting 𝜎 i = 1 can be set i.e. 𝜎 i = 1 for 0.5 ≤ wi < 1.5.
15 n−1 15 n−1
As the selection criteria decides the size of the comparator, all values
Using the definition of variable wi given in (2), we obtain are chosen in such a way that a minimum number of bits are required
128 to represent all values of selection criteria in binary. All the values of
|wn | ≤ P [1] (7) selection criteria for 𝜎 0 can be expressed in binary using 5-bit only, and
15 n−1
128
hence 5-bit comparator is required for its implementation. The detail
For all i ≥ 1, it is verified that inequality P [1]
15 n−1
≤ Pn [8] + of this implementation is discussed in the section on the architecture of
(8∕15)Pn [1] is valid. the proposed radix-16 CORDIC algorithm.
|wn | ≤ Pn [8] + (8∕15)Pn [1] (8)
2.3. Scale factor compensation
Hence theorem is also verified for i ≥ 1.
For radix higher than two, the scale factor is variable, and it depends
2.2. Selection function upon the value of 𝜎 i through the input angle z0 and iteration being
executed. Scale factor compensation can be a considerable concern for
Multiplication of 𝜎 i with xi cannot be achieved by shift operation the designers as it is not constant. After computing the total rotation,
whenever the selection function is not a two to the power of n, where the CORDIC algorithm scales up the final value of coordinates of the
n is an integer. To handle such a situation, two partial products are rotated vector by scale factor Ki𝜎i given by
generated.
One additional 4-to-1 multiplexer is added in data-path to gener- Ki−𝜎1 = (1 + 𝜎i2 16−2i )−1∕2 = (1 + 𝜎i2 2−8i )−1∕2 (9)
i
ate the second partial product. The detail on partial products is given
To find out the true values of vector, the rotated vector has to be mul-
in Table 1. Since both the multiplexers compute the partial products n
tiplied by Ki−𝜎1 . The scale factor is determined for 16 + 1 iteration and
simultaneously, the added multiplexer does not contribute to the crit- i
n
ical delay of data-path. Later generated partial products and rotating for iteration i > 16
+ 1, the scale factor is assumed to be one, as in each
vector are added using 3-to-2 CSA (carry-save arithmetic). As shown in iteration the value of 𝜎i2 is shifted right by 8-bit positions. In this imple-
Table 1, 5-to-1 and 4-to-1 multiplexers are required to generate first and mentation, 16-bit fixed-point precision is taken to represent the value
second partial products, respectively. The detail on the implementation of the scale factor. For all possible values of 𝜎 i , the scaled factor is pre-
of this scheme is given in the section on architecture. calculated and stored in the look-up table. Later, the rotated vector is
The criteria for the selection function is given in (4). To obtain multiplied by the stored scale factor using shift and add operations. The
the limits of selection intervals, the overlapping area between intervals scale factor is broken into 4n number of nibbles, and hence for 16-bit
given in (4) for all the values of selection function must be obtained. precision, four nibbles have to be considered. To multiply the rotated
91
Fig. 1. Overlapping area between the intervals.
Table 2
Selection criteria for i = 0, i = 1, and i > 1.
Selection function i = 0 i = 1 i > 1
8 w0 ≥ 1.5 w1 ≥ 7 wi ≥ 7.5
7 1.5 > w0 ≥ 1.375 7 > w1 ≥ 6.125 7.5 > wi ≥ 6.5
6 1.375 > w0 ≥ 1.25 6.125 > w1 ≥ 5.25 6.5 > wi ≥ 5.5
5 1.25 > w0 ≥ 1.125 5.25 > w1 ≥ 4.375 5.5 > wi ≥ 4.5
4 1.125 > w0 ≥ 1 4.375 > w1 ≥ 3.5 4.5 > wi ≥ 3.5
3 1 > w0 ≥ 0.875 3.5 > w1 ≥ 2.5 3.5 > wi ≥ 2.5
2 0.875 > w0 ≥ 0.75 2.5 > w1 ≥ 1.5 2.5 > wi ≥ 1.5
1 0.75 > w0 ≥ 0.375 1.5 > w1 ≥ 0.5 1.5 > wi ≥ 0.5
0 0.375 > w0 ≥ −0.375 0.5 > w1 ≥ −0.5 0.5 > wi ≥ −0.5
−1 −0.375 > w0 ≥ −0.75 −0.5 > w1 ≥ −1.5 −0.5 > wi ≥ −1.5
−2 −0.75 > w0 ≥ −0.875 −1.5 > w1 ≥ −2.5 −1.5 > wi ≥ −2.5
−3 −0.875 > w0 ≥ −1 −2.5 > w1 ≥ −3.5 −2.5 > wi ≥ −3.5
−4 −1 > w0 ≥ −1.125 −3.5 > w1 ≥ −4.375 −3.5 > wi ≥ −4.5
−5 −1.125 > w0 ≥ −1.25 −4.375 > w1 ≥ −5.25 −4.5 > wi ≥ −5.5
−6 −1.25 > w0 ≥ −1.375 −5.25 > w1 ≥ −6.125 −5.5 > wi ≥ −6.5
−7 −1.375 > w0 ≥ −1.5 −6.125 > w1 ≥ −7 −6.5 > wi ≥ −7.5
−8 −1.5 > w0 −7 > w1 −7.5 > wi
vector with a nibble, four partial products have to be generated. To Table 3

restrict the partial products to two, and to reduce the hardware required CSD coding of unsigned nibbles.
to implement this scheme, the concept of canonical signed digit (CSD)
Nibble in binary CSD Coding Stage-1 Stage-2 Carry out c4 c3 c2 c1 c0
representation presented in Refs. [29,30] is used. Each nibble of the
0000 0000 0 0 0 00000
scale factor is converted into CSD format, and then it is stored on a
0001 0001 0 1 0 00001
look-up table. Note that, in CSD representation one digit can have any 0010 0010 0 2 0 00010
value from the value set {1, 0, 1} where 1 represents the −1. All the 0011 0101 4 −1 0 01101
possible values of nibbles and their CSD format are listed in Table 3. 0100 0100 4 0 0 01000
Stage-1 and stage-2 indicate the multiplicand to generate partial prod- 0101 0101 4 1 0 01001
0110 0110 4 2 0 01010
ucts. All the values of stage-1 and stage-2 are an integer power of two,
0111 1001 8 −1 0 10101
and it can be multiplied with a rotated vector by the simple shift oper- 1000 1000 8 0 0 10000
ation. Stage-1 has four different values 0, 4, and ±8, and stage-2 has 1001 1001 8 1 0 10001
five different values 0, ±1 and ± 2. Two binary bits, c4 c3 are used to 1010 1010 8 2 0 10010
code the four different values of stage-1 and three binary bits, c2 c1 c0 1011 0101 −4 −1 1 11101
1100 0100 −4 0 1 11000
are used to code the five different values of stage-2. Hence, 4-to-1 and
1101 0101 −4 1 1 11001
5-to-1 multiplexers are needed to generate two partial products. Also, 1110 0010 0 −2 1 00110
these two multiplexers compute simultaneously, and only the delay of 1111 0001 0 −1 1 00101
the 5-to-1 multiplexer contributes to the critical data path. The detail
of these implementations is given in the section on architecture. Since
5-bit is needed to represent the nibble in CSD format, CSD representa-
tion of 16-bit scale factor requires 20-bit. Since there are nine different In fully pipelined architecture, all stages compute the rotation paral-
absolute values of the scale factor, the size of the look-up table is 81 by lelly as there is dedicated hardware for each stage, which leads to high
20-bit. throughput. Two partial products are generated to handle the complex
selection function, which are added to a rotating vector in each iter-
ation. To perform this addition, redundant (carry-save) arithmetic is
2.4. The architecture of proposed Radix-16 CORDIC algorithm used. Architecture for iteration index i = 0, i = 1, and i > 1 will be
similar except the comparator (because the selection criteria for itera-
To increase the throughput, we propose fully pipelined architecture tion index i = 0, i = 1, and i > 1 are different). However, architec-
for FPGA implementation of our proposed radix-16 CORDIC algorithm. ture for iteration index i = 0 is discussed here.
92
Fig. 3. Architecture of shifter.
below,
TCP = TComp. + TMux + TFA + TCPA (10)
Fig. 4 shows the architecture of the two-stage compensation scheme.

In the first stage, 20-bits (n3 n2 n1 n0 ) CSD coded scale factor is accessed
from scale factor ROM based on the value of the selection function
of the first and second iteration. The four shifters generate the eight
partial products based on the value of n3 , n2 , n1 , and n0 . The architec-
ture of shifter is the same as the architecture shown in Fig. 3 except
the multiplicand weights. Fig. 5 shows the 8-to-2 CSA scheme used
to add all eight partial products. The numbers 1, 0, … , −12, −13, −14
represent the bit position of the particular partial product. In the first
Fig. 2. Fully pipelined architecture of the X, Y and W rotators of radix-16 addition, three partial products X1 , X2 , and X3 are added to generate
CORDIC processor. sum and carry vector S1 and C1 . Similarly, the second addition is per-
formed to generate S2 and C2 . The solid horizontal line indicates the
break between the two levels of the 8-to-2 CSA. The vertical box cov-
ering three-bit positions is a full adder whereas, and covering two-bit
positions is a half adder. The final sum and carry vector, S6 and C6 ,
The content of the ROM table for each iteration is different, as it are forwarded to the pipeline register for further processing. The total
stores the value of tan−1 (𝜎 i 16−i ). Fig. 2 shows the architecture of iter- delay of the first stage of the compensation scheme can be expressed as,
ation when i = 0. X, Y and W rotators are implemented separately as
given in (1) and (2). For i = 0, the value of the variable w0 is the same TDCP = TMUX + 4 × TFA (11)
as input angle z0 . The value of the variable w0 has to be compared with The sum and carry vector, S6
and C6 ,
generated in the first stage, are
the selection criteria given in Table 2. The value of w0 is represented added using carry propagation adder (CPA) in the second stage, which
in fixed-point representation. The selection criteria given in Table 2 is has a delay of one CPA. The value of K1 is always between 1 and 8.811
.
chosen in such a way that 5 binary bits are required for comparison. The i
1
comparator compares the value of w0 with selection criteria, and it gen- The multiplication of Xi+1 and Yi+1 with the Ki
does not result in over-
erates the selection function as an output. Once the selection function flow, and final value of the multiplication can be truncated using 16-bit
is known, the corresponding value of tan−1 (𝜎 i 16−i ) is accessed from the without any error.
angle table (ROM). Later the values of tan−1 (𝜎 i 16−i ) and w0 are added For 16-bit precision, a total of four iterations are required to com-
using carry propagation adder (CPA). The result of this addition is mul- pute total rotation, and two additional iterations are required for com-
tiplied by 16 using a fixed shift and forwarded to the next stage for
further processing. The architecture of X and Y rotators are similar, and
only X rotator is discussed here. X rotator has two data paths. In the
first data path, the variable x0 moves from pipelined register to 3-to-2
CSA directly. In the second data path, the x0 moves through the shifter
to generate two partial products, x0p1 and x0p2 , to handle the complex
selection function. For example, if selection function is 7 then shifter
generates two partial products x0p1 = 8x0 and x0p2 = −x0 . These two
partial products are used by Y-rotator to compute y1 . In X rotator, 3-to-2
CSA generates two vectors, x1s and x1c by adding two partial products,
y0p1 and y0p2 generated by Y rotator and x0 . Later, these two vectors
are added by carry propagation adder (CPA) to compute the value of
x1 . The value of x1 is stored on a pipelined register and is forwarded to
the next stage for further computation.
Fig. 3 shows the architecture of shifter. 5-to-1 and 4-to-1 multiplex-
ers generate the two partial products based on the value of the selection
function. As shown in Fig. 3, the shifter multiplies the input variable
with constants whose value are the integer power of two. Hence multi-
plication can be achieved by the fixed shift of binary bits, which does
not have any delay. The delay of the shifter is the same as the 5-to-1
multiplexer as both multiplexers are computing parallelly. 3-to-2 CSA
is implemented using one level of full adders, and hence the delay is
the same as one full adder. The total delay of the critical path is given Fig. 4. Architecture of the two-stage compensation scheme.
93
Fig. 5. 8-to-2 CSA scheme.
pensation. High precision of scale factor may require more shifters, hybrid CORDIC algorithm. The implementation of the proposed radix-
ROM to store scale factor and full adders in the compensation stage. 16 CORDIC algorithm requires 128 full adders for 16-bit precision as
High precision of X and Y vectors may require more iterations to com- compared to 256 full adders of the radix-2 CORDIC algorithm. The pro-
pute the total rotation. For example, for 20-bit precision of scale fac- posed radix-16 CORDIC algorithm has overhead in terms of a 5-to-1
tor, a 25-bit CSD coded scale factor has to be stored on ROM, and multiplexer and 5-bit comparator. For proposed radix-16 CORDIC algo-
five shifters are required to generate ten partial products. Later, 10- rithm, 9N4
× N ROM is required to store precomputed tan−1 (). The size
to-2 CSA can be used to generate sum and carry vectors by adding ten of this ROM is 36 × 16 bits for N = 16 bits, and it can be imple-
partial products. In general, for n-bit precision, total 4n + 2 iterations mented using nine slices only as one slice can store 64 bits of data. Sim-
are required to compute total rotation and to compensate for the scale ilarly, twenty-five slices are required to store the precomputed scale
factor. factor. The shift-add operation required to multiply scale factor with
The proposed radix-16 CORDIC algorithm is compared with other coordinates of the rotated vector is similar for all implementation. As a
known CORDIC algorithms. The following parameters are considered result, the implementation of the proposed radix-16 CORDIC algorithm
for comparison. 1. Iterations needed to measure the complete rotation, has less hardware compared to radix-2 CORDIC and recently proposed
2. The area in terms of full adders, 3. Critical path delay of iteration CORDIC algorithms.
and 4. Overhead in terms of additional multiplexers and comparators
required for implementation 5. ROM size required to store precom-
3. The proposed architecture of radix-2 DIF FFT algorithm
puted tan−1 () and 6. Type of scale factor (i.e., constant or variable)
and resource required to store and compensate for the scale factor.
In the case of a real-time application, which processes the contin-
The detailed comparison is shown in Table 4. Total full adders needed
uous flow of data, pipelined architecture is a better fit for such an
to implement one rotator (i.e., X or Y rotator) are given in the third
application. In proposed implementation, we have chosen the pipelined
column. For simplicity, delay due to overhead is not considered in a
architecture to increase the throughput and radix-16 CORDIC based
critical path as it would be minimal compared to the delay of carry
rotator is used to reduce the hardware resources.
propagation adder (tCPA ). As shown in Table 4, hybrid CORDIC algo-
rithm presented in Ref. [25] and proposed radix-16 CORDIC algorithm
requires a similar number of iteration to compute the total rotation but 3.1. Butterfly processing unit
proposed radix-16 CORDIC algorithm has less critical data path and it
requires 2N full adders for one rotator as compared to 3N full adders of For N-point FFT computation, log2 N stages are required. In
pipelined architecture, each stage has a butterfly unit to perform a but-
94
Table 4
Comparison of proposed radix-16 CORDIC algorithm with other proposed CORDIC algorithms.
CORDIC Parameters ROM size to Scale factor
Algorithm store tan−1 () compensation
No. of FAs∕ Total Overhead Critical Path FAs for 16-
Iterations rotator FAs delay bit precision
Radix-2 N N N2 No overhead tCPA 256 N × N Constant, shift-add
CORDIC [20]
N N2 3N
Radix-4 2
N 2
3_1 Mux + tCPA 128 2
×N Variable, ROM
N +1
CORDIC [23] 4_bit Comp. 3 4 × N, shift-add
N +3 3N (N +3)
Double Step 2
3N 2
Not reported 2tCPA 456 (N + 4) × N Constant, shift-add
Branching
CORDIC [24]
N 3 2 6N
Hybrid 4
3N 4
N 2∗ (3_1Mux + 2tCPA 192 4
×N Variable, ROM
N
CORDIC [25] 4_bitComp.) 9 16 +1 × N, shift-add
N 2N 2 5N
Radix-8 3
2N 3
4_1 Mux + tCPA + tFA 170 3
×N Variable, ROM
N +1
CORDIC [21] 5_bit Comp. 58 × N, shift-add
N N2 9N
Radix-16 4
2N 2
5_1 Mux + tCPA + tFA 128 4
×N Variable, ROM
N
CORDIC 5_bit Comp. 9 16 +1 × 5N ,
4
shift-add
3.2. Double-path delay commutator feed-forward FFT architecture
The proposed double-path delay commutator (DDC) pipelined feed-

forward architecture is optimized multi-path delay commutator (MDC)
architecture, and optimization is applied to compensate for the latency
of the proposed radix-16 CORDIC algorithm. In the proposed DDC archi-
tecture, the flow of data is continuous as each stage has a dedicated
butterfly processing unit to perform the butterfly operation. Input sam-
ples have been divided into two parallel streams by a shift register. The
shift register is used to store the intermediate data and to provide suf-
ficient delay between two streams to rearrange the data. The proposed
double-path delay commutator feed-forward architecture of 512-points
DIF FFT algorithm is shown in Fig. 7. Butterfly structure of the first
Fig. 6. The architecture of CORDIC rotator based proposed butterfly processing stage has real adders and subtractors as input is assumed to be real.
unit. If the input is not real, the butterfly structure uses complex adders and
subtractors like other stages. In conventional feed-forward architecture,
the first stage has N2 shift register before the butterfly processing unit
terfly operation. Adder, subtractor, and rotator (for twiddle multipli- and N4 shift register after the butterfly processing unit which is halved
cation) are the main components of butterfly architecture. The rotator in successive stages. We have modified this architecture to compen-
is the critical component in the butterfly engine as it requires a signifi- sate for the latency of the CORDIC algorithm. Since the CORDIC algo-
cant percentage of the total area. We propose the CORDIC based rotator rithm takes 6 clock cycles to compute sample Z2 , Z2 is delayed by six
to perform the two-dimensional vector rotation to remove the complex clock cycles as compared to Z1 . Hence, we propose ( N4 − 6) shift regis-
multiplier from the design. 16-bit fixed-point binary representation is ter after the butterfly processing unit to compensate for the latency of
used to represent the real and imaginary parts of the sample. Input to the CORDIC algorithm. As a result, 512-points FFT architecture, shown
the first stage of the FFT algorithm is assumed real, and hence, the in Fig. 7 has a 256-shift register before the butterfly processing unit
imaginary part is zero. The butterfly processing unit of the first stage is and 122-shift register after the butterfly processing unit. This concept
modified accordingly. Rotator rotates the two dimensional vector (i.e. can be extended to all stages whose shift register value after the butter-
xr + ixi ) by twiddle angle 𝜃 to multiply twiddles (i.e. cos(𝜃) + isin(𝜃)) fly processing unit is greater than six. The commutator consists of two
with two-dimensional vector. For N-point FFT computation, N2 twiddle multiplexers, as shown in Fig. 8. The commutator rearranges the data
angles are required to be stored on memory for the first stage computa- of two input streams for next stage butterfly computation. When the
tion. For each successive stage, the twiddle angles are halved. The range en input of commutator is high, it swaps the data between two input
of twiddle angle for FFT calculation is from 0 to − 𝜋 , whereas the range streams, and when en input is low, the data passes as it is. The architec-
of input angle of the proposed CORDIC algorithm is from − 𝜋2 to 𝜋2 . tures of stage-4 and stage-5 are similar to previous stages, and they are
This problem can be solved by just interchanging the value of sine and omitted from Fig. 7. In stage-6 and stage-7, the commutator needs the
cosine at the last stage of CORDIC. Fig. 6 shows the implementation of delay of 4 and 2 clock cycles respectively between two streams of data
the butterfly unit. Butterfly processing unit takes two inputs x1 and x2 to rearrange the samples on each stream correctly. The butterfly pro-
where suffix r and i represent the real and imaginary part, respectively. cessing unit generates the delay of six clock cycles between two samples
The proposed radix-16 CORDIC algorithm takes four clock cycles to Z1 and Z2 . Since stage-6 needs the delay of four clock cycles between
compute the total rotation for 16-bit precision and two additional clock two samples Z1 and Z2 , Z1 is delayed by two clock cycles, and similarly,
cycles to compensate for the scale factor. Pipelined registers are used Z1 is delayed by four clock cycles in stage-7. In stage-8, since only two
N
after complex adder/subtractor to reduce the critical path delay. Hence, twiddles i.e., WN0 = 1 and WN2 = −j are used, twiddle multiplication can
a total of seven clock cycles is used to compute (x1 − x2 ) × WNn , which be achieved by the multiplexer and adder/subtractor. As a result, the
is shown as z2 in Fig. 6. Only a complex adder is needed to compute z1 , butterfly operation of stage-8 has a delay of two clock cycles. In stage-8,
and it is directly forwarded to the next stage.
95
Fig. 7. Proposed architecture of Double-path Delay Commutator (DDC) R2DIF FFT algorithm.
mated to B adders as the implementation of one adder requires B full

adders [29]. Hence, the complex multiplier can be approximated to
3B + 4 adders. The VLSI implementation of the complex adder requires
two adders. X, Y, and Z rotators of the radix-2 CORDIC algorithm can
be implemented using one adder, and hence one stage of the radix-
2 CORDIC algorithm requires three adders. The hardware complexity
of the radix-2 CORDIC algorithm is 3B adders as it requires to exe-
cute B iterations for B-bit accuracy. One stage of the proposed radix-16
CORDIC algorithm has hardware complexity of five adders as X and
Y rotators can be implemented using two adders, and Z rotator can
be implemented using one adder. The rotation stage of the proposed
algorithm can be implemented using 5B 4
adders as it requires to exe-
B
cute 4
iterations for B-bit accuracy. The compensation stage of the
Fig. 8. Implementation of the commutator using 2-to-1 multiplexers. B
proposed radix-16 CORDIC algorithm has hardware complexity of 2
,
7B
which results in total hardware complexity of 4
.The required number
the commutator needs a delay of one clock cycle between two streams of adders for 16-bit precision is listed in the sixth column of Table 5. The
of data to rearrange them correctly for next stage computation. Since proposed implementation has less number of resources as compared to
the butterfly operation of stage-8 provides a delay of two clock cycles, other implementations with similar delay resources.
sample Z1 is delayed by one clock cycle for commutator to rearrange
the data. The butterfly processing unit of the last stage consists of only
4. Experimental results and discussion
adders and subtractors as twiddle value for the last stage is one, twiddle
multiplication is not required.
In the FFT algorithm, if the input to the first stage is assumed to
Since the proposed design is fully pipelined, the number of cycles
be real and bounded in the interval [−1, 1), then the output generated
required to process the N-point FFT is the same as latency. The latency
by the last stage of the FFT is also bounded in the interval [−N, N)
of the first stage is the 129 clock cycles as a shift register uses 128 clock
[32]. For such bounded input, the real and imaginary
√ √ parts of the inter-
cycles, and one additional clock cycle is used by pipeline register. Sim-
ilarly, the latency of the second, third, fourth, and fifth stages would nal signal are also bounded in the interval [− 2, 2) [32,33] which
be 65, 33, 17, and 9, respectively. The latency of the sixth and seventh can be represented using FXP(16,14). The N-point FFT algorithm has
stages would be 7 (including pipeline latency) as the latency of the but- an inherent gain of N that may grow the data path by log2 N bits. To
terfly processing unit is dominant in these stages compared to the shift restrict this bit growth (i.e., arithmetic overflow), the downscaling by
register. As discussed earlier, the latency of the eighth and last stage a factor of two is performed after each butterfly computation as given
would be 2 and 1, respectively. Hence, the proposed 512 points FFT in Refs. [32–34]. However, signals within the butterfly implementa-
implementation has a total latency of 270 clock cycles. In general, for tion are allowed to grow, but the final output of the butterfly stage
N−point FFT, the latency of the shifter is N−216 cycles, and the latency is truncated using 16-bit as the magnitude of the twiddle is unit. The
of the last four stages is 13 cycles irrespective of the number of points proposed architecture has been implemented for field-programmable
of the FFT. Also, each stage of the proposed design has a latency of one gate arrays (FPGAs). The Verilog HDL is used to implement the archi-
cycle due to pipeline registers resulting in a total latency of log2 N. The tecture of the proposed design. The proposed architecture exhibits the
number of cycles required to process the N−point FFT can be expressed maximum modularity and regularity in the design. The functional ver-
as below, ification of the proposed design is carried out using the Xilinx Vivado
Design Suite. The proposed design is synthesized and implemented for
N − 16 Virtex-7 series FPGA for an operating frequency of 200 MHz.
Ncycles = + log2 N + 13 (12)
2 Implementation of the complex multiplier needs three real multipli-
Table 5 shows the comparison of the proposed architecture to other ers and four real adders [31], which acquires the significant percentage
architectures for the computation of an N-point FFT in terms of required of the total area of the butterfly processing unit and consumes more
arithmetic and delay resources and latency. For comparison purposes, power. This area is reduced by replacing the complex multiplier with
the complex multiplier, complex adder, and CORDIC based rotator have our proposed radix-16 CORDIC algorithm based rotator. To carry out
been approximated with their equivalent adders for B-bit accuracy and the twiddle multiplication, our proposed rotator uses the twiddle angles
listed in the fifth column of Table 5. The VLSI implementation of the only. Hence, only twiddle angles are required to be stored on the mem-
complex multiplier requires three real multipliers and four real adders ory, which reduces the size of a ROM memory as compared to conven-
[31]. The implementation of B × B array multipliers needs B2 AND tional FFT implementation designed with the complex multiplier. The
gates, B2 half adders, and B(B − 2) full adders and can be approxi- Xilinx IPs of complex multiplier [31] and CORDIC based rotator [35]
96
Table 5
Comparison of the proposed architecture to other architectures for the computation of an N-point FFT in terms of required arithmetic and
delay resources, and latency.
FFT Algorithm Arithmetic Resources (n = log2 N) Total Adder for Delay Resources Latency
16-bit accuracy
Complex Complex CORDIC Total Adder for
Multiplier Adder Rotator B-bit accuracy
N
R2MDC [11] n − 1 2n 0 3B(n − 1) + 8n − 4 56n − 52 N 2
3N 11N
R2MDC [14] n − 2 2n 0 3B(n − 2) + 8n − 8 56n − 104 2
−2 8
−4
R2M2 DF [13] n − 2 2n 0 3B(n − 2) + 8n − 8 56n − 104 N − 2 N
N
R2DDC [12] 0 2n n − 2 3B(n − 2) + 4n 56n − 48 N 2
+n−2
7B N −16
Proposed, R2M• DC 0 2n n − 2 4
(n − 2) + 4n 32n − 56 N − 6n + 20 2
+n+ 13
Table 6
Comparison of resource utilization of proposed radix-16 CORDIC based rotator
with Xilinx IP core of radix-2 CORDIC rotator and complex multiplier.
Design Hardware Resource Utilization Latency
Slice Slice Total
LUTs registers Slices
Xilinx ComplexMultiplier IP V.6 1019 940 1959 5
Xilinx CORDICIP V.6 1146 1201 2347 16
Proposed radix-16CORDIC based rotator 954 336 1290 7
Table 7
Performance comparison of the proposed CORDIC rotator based R2DIF FFT
algorithm and conventional memory-based architectures.
FFT Points Conventional design Proposed design % Improvement in Performance
Slice Latency Slice Latency Slice Latency
512 14,844 520 11,302 270 23.86 48.08
1024 20,195 1033 13,449 527 33.4 48.98
2048 24,827 2058 16,273 1040 34.45 49.47
4096 34,383 4107 20,615 2065 40.04 49.72
Average performance improvement 32.96 49.06
are implemented for 16-bit fixed-point precision for real and imaginary Table 8
value, and it is synthesized for Virtex-7 series FPGA for the compar- SNR comparison of the proposed design and Matlab implementation.
ison purpose. Table 6 shows the resource utilization of our proposed
FFT Point SNR in dB PSNR in dB
radix-16 CORDIC based rotator and Xilinx IPs of complex multiplier
Proposed Implementation Matlab Implementation
and CORDIC based rotator. Table 6 suggests that our proposed radix-16
CORDIC based rotator has 34% less hardware utilization as compared to 512 40.18 41.22 62.65
1024 45.16 49.57 60.34
Xilinx IPs of complex multiplier [31] and 45% as compared to CORDIC 2048 37.45 41.09 59.06
rotator [35]. Comparison between the proposed design and conven- 4096 44.92 49.52 57.53
tional fully pipelined DDC implementation [12] of variable length FFT
algorithm is presented in Table 7 considering two parameters, resource
utilization in terms of slices used and latency in terms of clock cycles
of our proposed architecture is measured and compared with the results
used to compute the FFT of an input sequence. The conventional fully
obtained using the Matlab function fft in double-precision floating-point
pipelined DDC architecture presented in Ref. [12], has used LUTs to
format. First, the input signal, xt = 0.5 × (sin(2𝜋 f1 t) + sin(2𝜋 f2 t)) is
store the complex twiddles. It has used the complex multiplier, imple-
sampled with sampling rate fs where, f1 = 0.2 MHz, f2 = 0.3 MHz,
mented using CLBs, to carry out the twiddle multiplications. Table 7
and fs = 1 MHz. Later, the sampled analog input is converted to digital
suggests that our proposed implementation uses an average of 32.96%
in FXP(16,14) fixed-point representation and applied to our proposed
fewer slices as compared to conventional pipelined DDC architecture.
architecture. The output of the proposed architecture is brought back
Reduction in slices is possible as our proposed radix-16 CORDIC based
to Matlab to measure the SNR. The SNR is defined as the ratio of the
rotator has less hardware utilization as compared to the complex multi-
power of the input signal to the sum of the power of all frequency com-
plier, as shown in Table 6. The number of stages in FFT design depends
ponents (noise) in the output which are not the part of the input signal.
on the number of points of the FFT, and each stage has one rotator
The peak signal-to-noise ratio (PSNR) is also measured considering the
to perform the complex multiplication. Hence, if the size of the FFT
Matlab output as reference. The peak signal-to-noise ratio (PSNR) is
increases, percentage saving in the slice increases, which can be seen
defined as the ratio of a maximum power of input signal to mean of the
in Table 7. As shown in Table 7, the proposed implementation has 49%
sum over all squared differences.
less computational latency compared to conventional pipelined DDC
( )
architecture as our proposed design processes two data samples in one A2max
clock cycle. For accuracy measurement, the signal-to-noise ratio (SNR) PSNR = 10 log 1 ∑N (13)
2
N i=1 (Ai − Bi )
97
Fig. 9. The output of the proposed architecture and Matlab implementation.
where Amax is the maximum amplitude of the input, Ai is the output are also added in Table 9. As the order of the input and output sam-
of the proposed architecture and Bi is the output of the Matlab func- ples depend on different applications, hardware required to order the
tion fft. Table 8 shows the comparison of the proposed architecture data before the FFT and after the FFT is not considered for compar-
with the Matlab implementation for different length of the FFT. For ison purposes. Many architectures use the dedicated functional block
4096 point FFT, the loss in SNR is 4.6 dB and 1.04 dB for 512 points of the FPGA, i.e., DSP and block RAM. For comparison purposes, these
FFT due to quantization and truncation error. Fig. 9 shows the output functional blocks are converted into their equivalent slices. One block
of the proposed architecture and Matlab implementation. The PSNR RAM is approximated to the 550 slices as one slice can be configured to
of the proposed design is given in Table 8.The detailed comparison store 64-bits of data. One DSP block can be approximated to 385 slices
between our proposed pipelined architecture and other pipelined archi- as the height of the DSP block matches the one block RAM, and each
tectures is presented in Table 9 considering these parameters: input DSP block aligns horizontally with an 18K block RAM. Architectures,
data width, resource utilization, latency, throughput, and operating presented in Refs. [9,13–15,34], use the complex multiplier to imple-
frequency. The length of the FFT is considered for the classification ment the butterfly processing unit. The butterfly processing unit of the
of the FFT architecture. The resources used to implement the butter- architecture presented in Refs. [4,12] is implemented using a radix-2
fly operation and architecture used to implement the FFT algorithm CORDIC algorithm based rotator. The butterfly processing unit of our
Table 9
Performance comparison of the proposed R2DIF FFT implementation with other implementation.
FFT implementation strategy FPGA Resources Performance

Points Algorithm Architecture Data width Butterfly LUTs DSP Block RAM TotalLUTs Latency Throughput Operating
resources (cycles) samples/cycle Frequency
512 [14]b R2,FP,MDC 16-bit FXP CA,CM 3992 21 6 15,377 718 2 312 MHz
[13]b R2,PP,M2 DF 16-bit FXP CA,CM 3681 21 5 14,516 530 2 305 MHz
[4]a R2,FP,FF 8-bit FXP R2CORDIC 12,478 0 0 12,478 892 2 199 MHz
[34]c R2,PSI 16-bit FXP CA,CM 15,469 0 1.5 16,294 1137 1 245 MHz
[12]c R2,FP,DDC 16-bit FXP R2CORDIC# 12,109 0 0 12,109 263 2 200 MHz
Proposedc R2,FP,MDC• 16-bit FXP R16CORDIC 11,302 0 0 11,302 270 2 230 MHz
[4]a R2,FP,FF 8-bit FXP R2CORDIC 20,873 0 0 20,873 1682 2 186 MHz
[34]c R2,PSI 16-bit FXP CA,CM 16,098 0 2 17,198 2167 1 245 MHz
[7]d R4,PP,MDC 16-bit FXP CA,CM 1425 48 0 19,905 285 4 270 MHz
4096 [15]d R2,FP,SDC-SDF 16-bit FXP CA,CM 11,171 20 4 21,071 8204 1 295 MHz
R2: Radix-2 FFT, FP: Fully pipelined, PP: Parallel Pipelined, CA: Complex Adder, CM: Complex Multiplier, a : Virtex-2, b : Virtex-6.
c
: Virtex-7, d : Virtex-5, FF:Feed Forward, MDC: Multipath Delay Commutator, M2 DF: Mixed-decimation multipath Delay Feedback.
MDC• :Proposed optimized MDC architecture to compensate the latency of R16CORDIC, PSI: Pipelined Streaming I/Os.
SDC-SDF: Single-path Delay Commutator-Feedback, DDC: Double-path Delay Commutator, # : Optimized radix-2 CORDIC.
98
lization as compared to a complex multiplier based Xilinx FFT IP core.

Also, the proposed design has an average 49% less latency and 33%
less hardware utilization as compared to conventional pipelined DDC
implementation.
CRediT authorship contribution statement
Ankur Changela: Conceptualization, Methodology, Investigation,

Writing - original draft. Mazad Zaveri: Validation, Formal analysis,
Writing - review & editing. Deepak Verma: Validation, Resources,
Writing - review & editing.
Declaration of competing interest
Fig. 10. The comparison between our proposed FFT implementation with radix- The authors declare that they have no known competing financial
2 CORDIC based FFT implementation and Xilinx FFT IP core. interests or personal relationships that could have appeared to influence
the work reported in this paper.
proposed architecture is implemented using the radix-16 CORDIC algo- References

rithm based rotator, and twiddle angles are stored using slice registers
[1] M. Garrido, K.K. Parhi, J. Grajal, A pipelined FFT architecture for real-valued
and memory LUTs rather than complex block RAMs. From Table 9, it signals, IEEE Trans. Circ. Syst. I: Reg. Pap. 56 (12) (2009) 2634–2643, https://doi.
can be seen that our proposed implementation needs fewer slices as org/10.1109/TCSI.2009.2017125.
compared to other pipelined architectures. In pipelined architecture, [2] V. Iglesias, J. Grajal, M.A. Snchez, M. Lpez-Vallejo, Implementation of a real-time
spectrum analyzer on FPGA platforms, IEEE Trans. Instrumen. Measur. 64 (2)
each iteration of the CORDIC algorithm has dedicated hardware for (2015) 338–355, https://doi.org/10.1109/TIM.2014.2344411.
computation. The proposed radix-16 CORDIC algorithm needs N4 iter- [3] J.S. Bruno, V. Almenar, J. Valls-Coquillat, FPGA implementation of a 10 gs/s
ations to calculate the complete rotation as compared to N iterations variable-length FFT for OFDM-based optical communication systems, Microproc.
Microsyst. Embed. Hardware Des. 64 (2019) 195–204, https://doi.org/10.1016/j.
of radix-2 CORDIC. As a result, the proposed radix-16 CORDIC algo- micpro.2018.12.002.
rithm has less hardware utilization compared to the radix-2 CORDIC [4] M.A. Sanchez, M. Garrido, M. Lopez-Vallejo, J. Grajal, Implementing FFT-based
algorithm. Fig. 10 shows the comparison between our proposed FFT digital channelized receivers on FPGA platforms, IEEE Trans. Aero. Electron. Syst.
44 (4) (2008) 1567–1585, https://doi.org/10.1109/TAES.2008.4667732.
implementation with radix-2 CORDIC based FFT implementation [12]
[5] H. Huang, L. Xiao, CORDIC based fast radix-2 DCT algorithm, IEEE Signal Process.
and Xilinx FFT IP core [34] considering the resource utilization. From Lett. 20 (5) (2013) 483–486, https://doi.org/10.1109/LSP.2013.2252616.
Fig. 10, it can be observed that, as the number of points of the FFT [6] L. Yang, K. Zhang, H. Liu, J. Huang, S. Huang, An efficient locally pipelined FFT
processor, IEEE Trans. Circ. Syst. II: Express Briefs 53 (7) (2006) 585–589, https://
increases, the percentage improvement in slice utilization is better in
doi.org/10.1109/TCSII.2006.875306.
proposed implementation as compared to radix-2 CORDIC based rota- [7] M. Garrido, J. Grajal, M.A. Sanchez, O. Gustafsson, Pipelined radix-2kfeedforward
tor implementation. From the comparison shown in Table 9, proposed FFT architectures, IEEE Trans. Very Large Scale Integr. Syst. 21 (1) (2013) 23–32,
FFT implementation has an average 17.5% less resource utilization as https://doi.org/10.1109/TVLSI.2011.2178275.
[8] J. Cooley, J. Tukey, An algorithm for the machine calculation of complex fourier
compared to the radix-2 CORDIC rotator based implementation [12] series, Math. Comput. 19 (90) (1965) 297–301.
and 30% less resource utilization as compared to Xilinx FFT IP core [9] Z. Ma, X. Yin, F. Yu, A novel memory-based FFT architecture for real-valued
[34]. The proposed implementation is operated at a 200 MHz clock signals based on a radix-2 decimation-in-frequency algorithm, IEEE Trans. Circ.
Syst. II: Express Briefs 62 (9) (2015) 876–880, https://doi.org/10.1109/TCSII.
rate, and it gives two samples per clock cycle, achieving the throughput 2015.2435522.
of 400Msps (Mega samples per second). [10] S. Zhou, X. Wang, J. Ji, Y. Wang, Design and implementation of a 1024-point
high-speed FFT processor based on the FPGA, in: 2013 6th International Congress
on Image and Signal Processing (CISP), vol. 2, 2013, pp. 1112–1116, https://doi.
5. Conclusion org/10.1109/CISP.2013.6745222.
[11] C. Cheng, K.K. Parhi, High-throughput VLSI architecture for FFT computation,
IEEE Trans. Circ. Syst. II: Express Briefs 54 (10) (2007) 863–867, https://doi.org/
In many real-time applications, high-performance FFT is required 10.1109/TCSII.2007.901635.
by DSP systems, such as OFDM. In this paper, the high-performance [12] N.H. Nguyen, S.A. Khan, C.-H. Kim, J.-M. Kim, A high-performance,
radix-2 DIF FFT algorithm is presented, and pipelined architecture is resource-efficient, reconfigurable parallel-pipelined FFT processor for FPGA
platforms, Microprocess. Microsyst. 60 (2018) 96–106, https://doi.org/10.1016/j.
implemented on a Xilinx Virtex-7 series FPGA. The multi-path delay micpro.2018.04.003.
commutator architecture is optimized to compensate for the latency of [13] J. Wang, C. Xiong, K. Zhang, J. Wei, A mixed-decimation MDF architecture for
the proposed radix-16 CORDIC algorithm. The need for a complex mul- radix-2kparallel FFT, IEEE Trans. Very Large Scale Integr. Syst. 24 (1) (2016)
67–78, https://doi.org/10.1109/TVLSI.2015.2402207.
tiplier to perform complex multiplication is substituted with the pro-
[14] M. Ayinala, M. Brown, K.K. Parhi, Pipelined parallel FFT architectures via folding
posed novel low latency radix-16 CORDIC rotator to improve resource transformation, IEEE Trans. Very Large Scale Integr. Syst. 20 (6) (2012)
utilization. The proposed radix-16 CORDIC algorithm is implemented 1068–1081, https://doi.org/10.1109/TVLSI.2011.2147338.
[15] Z. Wang, X. Liu, B. He, F. Yu, A combined SDC-SDF architecture for normal i/o
with fully pipelined architecture to increase the throughput and carry-
pipelined radix-2 FFT, IEEE Trans. Very Large Scale Integr. Syst. 23 (5) (2015)
save (redundant) arithmetic is used to manage the condition of the com- 973–977, https://doi.org/10.1109/TVLSI.2014.2319335.
plex selection function. The proposed implementation has less hardware [16] B. Gold, T. Bially, Parallelism in fast Fourier transform hardware, IEEE Trans.
utilization as it requires N4 stages (one stage for each iteration) to calcu- Audio Electroacoust. 21 (1) (1973) 5–16, https://doi.org/10.1109/TAU.1973.
1162428.
late total rotation for N-bit precision. The canonical signed digit (CSD) [17] E.E. Swartzlander, W.K.W. Young, S.J. Joseph, A radix 4 delay commutator for fast
technique is used to limit the partial products for scale factor compen- Fourier transform processor implementation, IEEE J. Solid State Circ. 19 (5)
sation, which also improves the hardware utilization. The performance (1984) 702–709, https://doi.org/10.1109/JSSC.1984.1052211.
[18] A.M. Despain, Fourier transform computers using CORDIC iterations, IEEE Trans.
of our proposed variable-length FFT implementation is evaluated with Comp. C 23 (10) (1974) 993–1001, https://doi.org/10.1109/T-C.1974.223800.
other implementations considering throughput, hardware utilization, [19] S. Li, H. Xu, W. Fan, Y. Chen, X. Zeng, A 128/256-point pipeline FFT/IFFT
and latency as parameters. Experimental results suggest that the pro- processor for MIMO OFDM system IEEE 802.16e, in: Proceedings of 2010 IEEE
International Symposium on Circuits and Systems, 2010, pp. 1488–1491, https://
posed design is efficient in slices used (area) as our proposed imple- doi.org/10.1109/ISCAS.2010.5537355.
mentation has an average 17.5% less hardware utilization as compared [20] J.E. Volder, The CORDIC trigonometric computing technique, IRE Trans. Electr.
to radix-2 CORDIC based implementation and 30% less hardware uti- Comp. EC 8 (3) (1959) 330–334, https://doi.org/10.1109/TEC.1959.5222693.
99
[21] A. Changela, M. Zaveri, A. Lakhlani, ASIC implementation of high performance [28] M.D. Ercegovac, J. Muller, Digit-recurrence algorithms for division and square root
radix-8 CORDIC algorithm, in: 2018 International Conference on Advances in with limited precision primitives, in: The Thrity-Seventh Asilomar Conference on
Computing, Communications and Informatics (ICACCI), 2018, pp. 699–705, Signals, Systems Computers, 2003, vol. 2, 2003, pp. 1440–1444, https://doi.org/
https://doi.org/10.1109/ICACCI.2018.8554883. 10.1109/ACSSC.2003.1292224.
[22] A. Changela, M. Zaveri, A. Lakhlani, FPGA implementation of asynchronous [29] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, Oxford
mousetrap pipelined radix-2 CORDIC algorithm, in: 2018 International Conference University Press, Oxford, UK, 2000.
on Current Trends towards Converging Technologies (ICCTCT), 2018, pp. [30] S.-M. Kim, J.-G. Chung, K.K. Parhi, Low error fixed-width CSD multiplier with
252–258, https://doi.org/10.1109/ICCTCT.2018.8551112. efficient sign extension, IEEE Trans. Circ. Syst. II: Anal. Dig. Sign. Proc. 50 (12)
[23] E. Antelo, J. Villalba, J.D. Bruguera, E.L. Zapata, High performance rotation (2003) 984–993, https://doi.org/10.1109/TCSII.2003.820231.
architectures based on the radix-4 CORDIC algorithm, IEEE Trans. Comput. 46 (8) [31] I. Xilinx, LogiCORE IP Complex Multiplier v6.0, Product Specifications PG104,
(1997) 855–870, https://doi.org/10.1109/12.609275. 2015. November 18th.
[24] D.S. Phatak, Double step branching CORDIC: a new algorithm for fast sine and [32] J.G. Proakis, D.K. Manolakis, Digital Signal Processing, fourth ed., Prentice-Hall,
cosine generation, IEEE Trans. Comput. 47 (5) (1998) 587–602, https://doi.org/ Inc., USA, 2006.
10.1109/12.677251. [33] C. Ingemarsson, O. Gustafsson, SFF-the single-stream FPGA-optimized feedforward
[25] R. Shukla, K.C. Ray, Low latency hybrid CORDIC algorithm, IEEE Trans. Comput. FFT hardware architecture, J. Sign. Process. Syst. 90 (11) (2018) 1583–1592,
63 (12) (2014) 3066–3078, https://doi.org/10.1109/TC.2013.173. https://doi.org/10.1007/s11265-018-1370-y.
[26] Y.-N. Chang, An efficient VLSI architecture for normal i/o order pipeline FFT [34] I. Xilinx, LogiCORE IP Fast Fourier Transform v8.0, Product Specifications DS808,
design, IEEE Trans. Circ. Syst. II: Express Briefs 55 (2008) 1234–1238. 2012. July 25th.
[27] P.K. Meher, J. Valls, T.B. Juang, K. Sridharan, K. Maharatna, 50 years of CORDIC: [35] I. Xilinx, LogiCORE IP CORDIC v6.0, Product Specifications IPPG105, 2017.
algorithms, architectures, and applications, IEEE Trans. Circ. Syst. I: Reg. Pap. 56 December 20th.
(9) (2009) 1893–1907, https://doi.org/10.1109/TCSI.2009.2025803.
100

Integration, The VLSI Journal: Ankur Changela, Mazad Zaveri, Deepak Verma

Uploaded by

Copyright:

Available Formats

Integration, The VLSI Journal: Ankur Changela, Mazad Zaveri, Deepak Verma

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Integration, The VLSI Journal: Ankur Changela, Mazad Zaveri, Deepak Verma

Uploaded by

Copyright:

Available Formats

Integration, the VLSI Journal 73 (2020) 89–100

Contents lists available at ScienceDirect

Integration, the VLSI Journal

FPGA implementation of high-performance, resource-eﬃcient Radix-16

Fig. 1. Overlapping area between the intervals.

vector with a nibble, four partial products have to be generated. To Table 3

Fig. 3. Architecture of shifter.

TCP = TComp. + TMux + TFA + TCPA (10)

Fig. 4 shows the architecture of the two-stage compensation scheme.

Fig. 5. 8-to-2 CSA scheme.

3.2. Double-path delay commutator feed-forward FFT architecture

The proposed double-path delay commutator (DDC) pipelined feed-

mated to B adders as the implementation of one adder requires B full

Average performance improvement 32.96 49.06

Fig. 9. The output of the proposed architecture and Matlab implementation.

FFT implementation strategy FPGA Resources Performance

lization as compared to a complex multiplier based Xilinx FFT IP core.

CRediT authorship contribution statement

Ankur Changela: Conceptualization, Methodology, Investigation,

Declaration of competing interest

proposed architecture is implemented using the radix-16 CORDIC algo- References

You might also like