Integration, The VLSI Journal: Ankur Changela, Mazad Zaveri, Deepak Verma
Integration, The VLSI Journal: Ankur Changela, Mazad Zaveri, Deepak Verma
Integration, The VLSI Journal: Ankur Changela, Mazad Zaveri, Deepak Verma
A R T I C L E I N F O A B S T R A C T
Keywords: The fast Fourier transform (FFT) is an algorithm widely used to compute the discrete Fourier transform (DFT) in
CORDIC algorithm real-time digital signal processing. High-performance with fewer resources is highly desirable for any real-time
FFT
application. Our proposed work presents the implementation of the radix-2 decimation-in-frequency (R2DIF)
Double-path delay commutator (DDC)
FFT algorithm based on the modified feed-forward double-path delay commutator (DDC) architecture on FPGA
Canonical signed digit (CSD)
Redundant arithmetic
device. Need for a complex multiplier to carry out the multiplication of complex twiddle factors and large mem-
FPGA ory to store the twiddle factors are the main concerns for FFT implementation. Propose work aims to address
these issues. In this work, a high-performance radix-16 COordinate Rotational DIgital Computer (CORDIC) algo-
rithm based rotator is proposed to carry out the complex twiddle factor multiplication. Further, CORDIC needs
only rotational angles to carry out complex multiplication, which reduces the need for large memory to store
the twiddle factors. To compute the total rotation for n-bit precision, our proposed radix-16 CORDIC algorithm
takes n∕4 iteration as compared to n iteration of the radix-2 CORDIC algorithm. Our proposed architecture
of the radix-2 decimation-in-frequency (R2DIF) algorithm is implemented on a Virtex−7 series FPGA. Further,
the detailed comparison is presented between our proposed FFT implementation and other recently proposed
FFT implementations. Experimental results suggest that proposed implementation has less latency and hardware
utilization as compared to recently proposed implementations.
1. Introduction of the FFT algorithm is very low (O(Nlog(N))), it is suitable for the VLSI
implementation.
The discrete Fourier transform (DFT) performs a crucial role in the Modern real-time applications demand high-throughput, low-
field of digital signal processing [1–3]. DFT is used to design, analyze, latency, and less hardware utilization for efficient FFT implementation
and implement many digital signal processing (DSP) algorithms and [12]. To meet these demands, pipelined architectures [1,6,7,13–15] are
systems [2,4,5]. DFT has very high computation complexity (O(N2 )) widely used by hardware designers. The pipelined architecture allows
[6,7] and its VLSI implementation is not practical. FFT is an algorithm all stages to compute in parallel, which results in high-throughput and
to compute the N-point DFT efficiently. FFT algorithm given by Cooly- low-latency. Feed-forward (FF) [4,7,11,12,16,17] and feed-back (FB)
Tukey is mostly used to compute the N-point DFT when N is the integer [4,6,18,19] are the two main types of pipeline architectures. In feed-
power of two [8]. DIT (decimation-in-time) and DIF (decimation-in- forward architecture, data flows from the current stage to the next stage
frequency) are two most widely used FFT algorithms to compute the continuously, whereas few outputs of the butterfly stage are fed back
DFT of an N-point input sequence. The FFT algorithm decreases the to the memories at the same stage in feed-back architecture. Memory-
computational complexity of DFT to O(Nlog(N)) by decomposing the based iterative architectures have low resource requirements as com-
computation of DFT of a sequence of length N into successively smaller pared to pipelined architectures, but they have a problem of large
DFT and taking advantage of the periodicity and symmetry properties of latency [9,12].
the complex twiddle factor [9–11]. Since the computational complexity In this work, high-performance, and resource-efficient, FPGA imple-
mentation of the radix-2 DIF FFT algorithm is proposed. To improve the
∗ Corresponding author.
E-mail addresses: ankurchangela.ec@indusuni.ac.in (A. Changela), zaverimazad@gmail.com (M. Zaveri), deepak.verma@ahduni.edu.in (D. Verma).
https://doi.org/10.1016/j.vlsi.2020.03.008
Received 17 November 2019; Received in revised form 20 February 2020; Accepted 31 March 2020
Available online 16 April 2020
0167-9260/© 2020 Elsevier B.V. All rights reserved.
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
performance, our proposed radix-16 CORDIC algorithm based rotator sented in section 4. Section 5 concludes the proposed work.
and modified parallel pipelined double-path delay commutator archi-
tecture are used. The main component of the FPGA implementation of 2. The proposed radix-16 CORDIC algorithm
the FFT algorithm is a butterfly engine that uses two complex adders
and one complex multiplier. Further, four real adders and three real In a CORDIC algorithm, a rotating vector is rotated in a two-
multipliers are required to implement the complex multiplier, and the dimensional plane to achieve the total rotation for the specified angle.
complex adder can be implemented using two real adders. Proposed The given angle is divided into a series of random angles. However,
FFT implementation aims to reduce the arithmetic resources by using these arbitrary angles are chosen in such a way that the rotation of a
our proposed low-latency, high-performance radix-16 CORDIC algo- vector can be calculated by simple binary add and shift operations. In
rithm based rotator. CORDIC is a rotational algorithm that unites the the CORDIC algorithm, vector rotation is performed along with three
theory of rotational geometry in a two-dimensional plane along with coordinate systems: circular, hyperbolic, and linear. CORDIC processor
three coordinate systems [20–22]. The proposed radix-16 CORDIC rota- can be configured to evaluate the trigonometric and hyperbolic func-
tor is configured in circular rotation mode to multiply the complex tions by selecting a suitable mode of operation and coordinate system
twiddles with samples, which removes the need for the complex mul- [22,27].
tiplier. The radix-2 CORDIC algorithm has a drawback of high latency Radix of the CORDIC algorithm indicates the number of bits it can
as it takes 5n
4
iterations for n-bit precision to compute the total rota- process in one iteration. Radix-2n CORDIC algorithm can process n-bit
tion with scale-factor compensation [20]. The radix-4 CORDIC algo- in a single iteration. To speed up the computation, higher radix CORDIC
rithm has decreased the iterations to 2n to calculate the complete rota- may be employed to process more data bits in a single iteration, which
tion without scale factor compensation [23]. The double-step branch- also results in a reduction of latency. Radix-16 CORDIC algorithm can
ing CORDIC algorithm presented in Ref. [24] has decreased the total process 4-bits of data in one iteration. For rotation mode, we propose
iterations to n+2 3 to calculate the complete rotation. Hybrid CORDIC the following radix-16 CORDIC algorithm iteration.
algorithm reported that in Ref. [25] needs 3n + 1 iterations to mea-
8 xi+1 = xi − 𝜎i 16−i yi
sure the total rotation with scale factor compensation, but hardware
complexity is very high. Recently reported radix-8 CORDIC algorithm yi+1 = yi + 𝜎i 16−i xi
requires 3n iterations to evaluate the complete rotation and presented
the trade-off between hardware complexity and iterations required to zi+1 = zi − tan−1 (𝜎i 16−i ) (1)
achieve a certain precision [21]. Our proposed radix-16 CORDIC algo- ∏√
rithm is implemented with simple hardware as its VLSI implementation K= 1 + 𝜎i2 16−2i
needs only adders and shifters. Further, the radix-16 CORDIC algorithm i≥0
needs 4n iterations to find out the total rotation with two supplementary In radix-16 CORDIC algorithm, the rotation angle is divided into a series
iterations to compensate the scale factor for n-bit accuracy as compared of basic angles, i.e. tan−1 (𝜎 i 16−i ) where 𝜎 i represents the selection func-
to n iterations of the radix-2 CORDIC algorithm. The selection function tion. The selection function of radix-16 CORDIC algorithm can have any
of the radix-16 CORDIC algorithm is complex as it is not an integer value from 17 values i.e. 𝜎 i ∈ {−8, −7, … , 0, … , 7, 8} as compared to
power of two. This problem can be solved by adding the multiplexer in simple selection function of radix-2 CORDIC algorithm i.e. 𝜎 i ∈ {−1, 1}.
the data path. Radix-16 CORDIC rotator needs only twiddle angles to The problem of complex selection function can be solved by adding
multiply the twiddle factors with samples, which removes the need for a 4-to-1 multiplexer in the data path and using redundant arithmetic
large memory storage. The canonical signed digit (CSD) representation (carry-save). Detail of this implementation is discussed in the section
is used to represent the precomputed twiddle factors, which reduce the on architecture. From (1), it can be seen that the scale factor is not
partial product and hardware needed to add the partial products. a constant, and its value ranges from 1 to 8.81 and depends on two
Many researchers have tried to reduce the hardware by either reduc- parameters, iteration being executed and selection function.
ing the arithmetic resources or by increasing the utilization of arith-
metic resources. M. Ayinala et al. have reduced the area and power
2.1. A proof of convergence in rotation mode
by using register minimization technique and folding transformation
[14]. J. Wang et al. have presented a mixed-decimation multi-path
For the particular value of selection function 𝜎 i , we have to show
delay feed-back approach for the radix-2k FFT in Ref. [13]. They have
that variable z is limited in each iteration to prove the convergence
integrated the DIT operations into DIF operated computing units. Using
of the proposed CORDIC algorithm. To verify the convergence, the ini-
the concept of folding transformation, they have increased the arith-
tial value of the input angle z0 is assumed between − 𝜋∕2 to 𝜋∕2 i.e.
metic resources utilization and reduced the computing delay at the cost
z0 ∈ [−𝜋∕2, 𝜋∕2]. The theory of SRT division showed in Refs. [28,29]
of complex control logic. M. Sanchez has presented a detailed study
is applied to conclude the criteria for the selection function 𝜎 i . Accord-
of the FFT architectures and implemented the simplified monobit FFT
ing to a method presented in Refs. [28,29], we define a variable wi as
algorithm to maximize the throughput and to minimize the area [4].
follows:
Z. Wang et al. have implemented the combined single delay commuta-
tor (SDC), and single delay feed-back (SDF) pipelined FFT processor to wi = 16i zi
reduce the complex arithmetic resources but had a disadvantage of low-
To transform the variable zi+1 given in (1) into variable wi+1 , zi+1 is
throughput and high-latency [15]. Many researchers have presented a
multiplied with 16i+1 and it is rearranged as follows:
higher radix FFT algorithm. However, the VLSI implementation of these
algorithms needs complex hardware [6,7,26]. 16i+1 zi+1 = 16i+1 zi − 16i+1 tan−1 (𝜎i 16−i )
The rest of the paper has been organized as follows: Section 2 is
on the proposed high-performance, low-latency radix-16 CORDIC algo- wi+1 = 16(wi − Pi [𝜎i ]) (2)
rithm. The proof of convergence, scale factor compensation, calcula-
Pi [𝜎i ] = 16i tan−1 (𝜎i 16−i )
tions of selection function, and fully pipelined unrolled architecture of
the radix-16 CORDIC algorithm are also presented in section 2. The The recursive equation of wi , given in (2), is similar to the recursive
details of the FPGA implementation of the proposed radix-2 DIF FFT equation of the partial remainder of the high radix SRT division. For
algorithm is presented in section 3. Simulation results and compari- convergence of the radix-16 CORDIC algorithm, it is necessary that after
son of proposed implementation with existing implementations are pre- each iteration, the variable wi+1 is bounded in the interval. The method
for the radix-8 SRT division presented in Ref. [21] is used to derive the
90
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
method for the radix-16 SRT division. Two bound values, lower bound Table 1
M and upper bound N, are defined as follows: Decomposition of complex selection
function.
Mi [q] = Pi [q] − (8∕15)Pi [1]
Selection Generated partial products
Ni [q] = Pi [q] + (8∕15)Pi [1] (3) Function Multiplexer-1 Multiplexer-2
where q ∈ {±8, ±7, ±6, ±5, ±4, ±3, ±2, ±1, 0}. In each iteration, there ±8 ±8xi 0
±7 ±8xi ∓xi
must be a selection function, 𝜎 i = q, for which wi is limited in an
±6 ±4xi ±2xi
interval as follows: ±5 ±4xi ±xi
±4 ±4xi 0
Mi [q] ≤ wi ≤ Ni [q] (4) ±3 ±2xi ±xi
±2 ±2xi 0
Interval, given in (4), must be valid for at least one value of 𝜎 i = q. ±1 ±xi 0
For the algorithm to be converged, continuity of interval given in (4) 0 0 0
has to be verified. This verification is carried out for all iterations using
Matlab. To prove the convergence of the algorithm, 𝜎 i has to be selected
such a way that wi is bounded in interval given by
The selection function for i > 1 can be made independent of iter-
|wi | ≤ Pi [8] + (8∕15)Pi [1] (5) ation being executed. To achieve this condition, the maximum value
of the lower limit (M) and the minimum value of the upper limit (N)
The method of induction is used to prove the convergence of the radix-
are chosen as this makes overlapping between intervals independent
16 CORDIC algorithm. The proof is carried out in two parts. In the first
of the iteration index. Since limits M and N are monotonically increas-
part, the algorithm is verified for i = 0, and later algorithm is verified
ing function, the maximum value of the lower limit can be obtained
for i ≥ 1. Assuming the initial value of the input angle |w0 | ≤ 𝜋∕2,
when i = ∞ i.e., M∞ [q] and minimum value of the upper limit can be
from (5),
obtained when i = 2 i.e., N2 [q].
|w0 | ≤ P0 [8] + (8∕15)P0 [1] ≈ 1.85 The overlapping area between the intervals is shown in Fig. 1. The
criteria for selecting the selection function is derived separately for
Hence (5) is verified for i = 0. To prove the convergence for i ≥ 1, i = 0, i = 1, and i > 1 as given in Table 2. However, the case when
assuming that (5) is verified for i = n − 1 then (5) has to be proved i > 1 is discussed here. Selection criteria for i = 0 and i = 1 can be
valid for i = n. The criteria given in (4) can be rewritten for i = n − 1 obtained in a similar way. To find out the selection criteria for 𝜎 i = 1,
as below, first intervals for selecting 𝜎 i = 1 and 𝜎 i = 2 have to be found out. As
shown in Fig. 1, the interval for selecting 𝜎 i = 1 is 0.46 ≤ wi < 1.53
Mn−1 [q] ≤ wn−1 ≤ Nn−1 [q]
and interval for selecting 𝜎 i = 2 is 1.46 ≤ wi < 2.53. Hence for select-
To obtain the inequality given in (6), first M and N are replaced with ing 𝜎 i = 1, the value of wi has to be chosen from the overlapping
their values given in (3). Later expression is subtracted by Pi [q] and between these two intervals i.e. 1.46 ≤ wi < 1.53. Similarly, over-
multiplied by 16. lapping between two intervals for selecting 𝜎 i = 1 and 𝜎 i = 0 is
0.46 ≤ wi < 0.53. Using these obtained results, the range of wi for
128 128
− P [1] ≤ 16(wn−1 − Pn−1 [q]) ≤ P [1] (6) selecting 𝜎 i = 1 can be set i.e. 𝜎 i = 1 for 0.5 ≤ wi < 1.5.
15 n−1 15 n−1
As the selection criteria decides the size of the comparator, all values
Using the definition of variable wi given in (2), we obtain are chosen in such a way that a minimum number of bits are required
128 to represent all values of selection criteria in binary. All the values of
|wn | ≤ P [1] (7) selection criteria for 𝜎 0 can be expressed in binary using 5-bit only, and
15 n−1
128
hence 5-bit comparator is required for its implementation. The detail
For all i ≥ 1, it is verified that inequality P [1]
15 n−1
≤ Pn [8] + of this implementation is discussed in the section on the architecture of
(8∕15)Pn [1] is valid. the proposed radix-16 CORDIC algorithm.
|wn | ≤ Pn [8] + (8∕15)Pn [1] (8)
2.3. Scale factor compensation
Hence theorem is also verified for i ≥ 1.
For radix higher than two, the scale factor is variable, and it depends
2.2. Selection function upon the value of 𝜎 i through the input angle z0 and iteration being
executed. Scale factor compensation can be a considerable concern for
Multiplication of 𝜎 i with xi cannot be achieved by shift operation the designers as it is not constant. After computing the total rotation,
whenever the selection function is not a two to the power of n, where the CORDIC algorithm scales up the final value of coordinates of the
n is an integer. To handle such a situation, two partial products are rotated vector by scale factor Ki𝜎i given by
generated.
One additional 4-to-1 multiplexer is added in data-path to gener- Ki−𝜎1 = (1 + 𝜎i2 16−2i )−1∕2 = (1 + 𝜎i2 2−8i )−1∕2 (9)
i
ate the second partial product. The detail on partial products is given
To find out the true values of vector, the rotated vector has to be mul-
in Table 1. Since both the multiplexers compute the partial products n
tiplied by Ki−𝜎1 . The scale factor is determined for 16 + 1 iteration and
simultaneously, the added multiplexer does not contribute to the crit- i
n
ical delay of data-path. Later generated partial products and rotating for iteration i > 16
+ 1, the scale factor is assumed to be one, as in each
vector are added using 3-to-2 CSA (carry-save arithmetic). As shown in iteration the value of 𝜎i2 is shifted right by 8-bit positions. In this imple-
Table 1, 5-to-1 and 4-to-1 multiplexers are required to generate first and mentation, 16-bit fixed-point precision is taken to represent the value
second partial products, respectively. The detail on the implementation of the scale factor. For all possible values of 𝜎 i , the scaled factor is pre-
of this scheme is given in the section on architecture. calculated and stored in the look-up table. Later, the rotated vector is
The criteria for the selection function is given in (4). To obtain multiplied by the stored scale factor using shift and add operations. The
the limits of selection intervals, the overlapping area between intervals scale factor is broken into 4n number of nibbles, and hence for 16-bit
given in (4) for all the values of selection function must be obtained. precision, four nibbles have to be considered. To multiply the rotated
91
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
Table 2
Selection criteria for i = 0, i = 1, and i > 1.
Selection function i = 0 i = 1 i > 1
8 w0 ≥ 1.5 w1 ≥ 7 wi ≥ 7.5
7 1.5 > w0 ≥ 1.375 7 > w1 ≥ 6.125 7.5 > wi ≥ 6.5
6 1.375 > w0 ≥ 1.25 6.125 > w1 ≥ 5.25 6.5 > wi ≥ 5.5
5 1.25 > w0 ≥ 1.125 5.25 > w1 ≥ 4.375 5.5 > wi ≥ 4.5
4 1.125 > w0 ≥ 1 4.375 > w1 ≥ 3.5 4.5 > wi ≥ 3.5
3 1 > w0 ≥ 0.875 3.5 > w1 ≥ 2.5 3.5 > wi ≥ 2.5
2 0.875 > w0 ≥ 0.75 2.5 > w1 ≥ 1.5 2.5 > wi ≥ 1.5
1 0.75 > w0 ≥ 0.375 1.5 > w1 ≥ 0.5 1.5 > wi ≥ 0.5
0 0.375 > w0 ≥ −0.375 0.5 > w1 ≥ −0.5 0.5 > wi ≥ −0.5
−1 −0.375 > w0 ≥ −0.75 −0.5 > w1 ≥ −1.5 −0.5 > wi ≥ −1.5
−2 −0.75 > w0 ≥ −0.875 −1.5 > w1 ≥ −2.5 −1.5 > wi ≥ −2.5
−3 −0.875 > w0 ≥ −1 −2.5 > w1 ≥ −3.5 −2.5 > wi ≥ −3.5
−4 −1 > w0 ≥ −1.125 −3.5 > w1 ≥ −4.375 −3.5 > wi ≥ −4.5
−5 −1.125 > w0 ≥ −1.25 −4.375 > w1 ≥ −5.25 −4.5 > wi ≥ −5.5
−6 −1.25 > w0 ≥ −1.375 −5.25 > w1 ≥ −6.125 −5.5 > wi ≥ −6.5
−7 −1.375 > w0 ≥ −1.5 −6.125 > w1 ≥ −7 −6.5 > wi ≥ −7.5
−8 −1.5 > w0 −7 > w1 −7.5 > wi
92
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
below,
93
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
pensation. High precision of scale factor may require more shifters, hybrid CORDIC algorithm. The implementation of the proposed radix-
ROM to store scale factor and full adders in the compensation stage. 16 CORDIC algorithm requires 128 full adders for 16-bit precision as
High precision of X and Y vectors may require more iterations to com- compared to 256 full adders of the radix-2 CORDIC algorithm. The pro-
pute the total rotation. For example, for 20-bit precision of scale fac- posed radix-16 CORDIC algorithm has overhead in terms of a 5-to-1
tor, a 25-bit CSD coded scale factor has to be stored on ROM, and multiplexer and 5-bit comparator. For proposed radix-16 CORDIC algo-
five shifters are required to generate ten partial products. Later, 10- rithm, 9N4
× N ROM is required to store precomputed tan−1 (). The size
to-2 CSA can be used to generate sum and carry vectors by adding ten of this ROM is 36 × 16 bits for N = 16 bits, and it can be imple-
partial products. In general, for n-bit precision, total 4n + 2 iterations mented using nine slices only as one slice can store 64 bits of data. Sim-
are required to compute total rotation and to compensate for the scale ilarly, twenty-five slices are required to store the precomputed scale
factor. factor. The shift-add operation required to multiply scale factor with
The proposed radix-16 CORDIC algorithm is compared with other coordinates of the rotated vector is similar for all implementation. As a
known CORDIC algorithms. The following parameters are considered result, the implementation of the proposed radix-16 CORDIC algorithm
for comparison. 1. Iterations needed to measure the complete rotation, has less hardware compared to radix-2 CORDIC and recently proposed
2. The area in terms of full adders, 3. Critical path delay of iteration CORDIC algorithms.
and 4. Overhead in terms of additional multiplexers and comparators
required for implementation 5. ROM size required to store precom-
3. The proposed architecture of radix-2 DIF FFT algorithm
puted tan−1 () and 6. Type of scale factor (i.e., constant or variable)
and resource required to store and compensate for the scale factor.
In the case of a real-time application, which processes the contin-
The detailed comparison is shown in Table 4. Total full adders needed
uous flow of data, pipelined architecture is a better fit for such an
to implement one rotator (i.e., X or Y rotator) are given in the third
application. In proposed implementation, we have chosen the pipelined
column. For simplicity, delay due to overhead is not considered in a
architecture to increase the throughput and radix-16 CORDIC based
critical path as it would be minimal compared to the delay of carry
rotator is used to reduce the hardware resources.
propagation adder (tCPA ). As shown in Table 4, hybrid CORDIC algo-
rithm presented in Ref. [25] and proposed radix-16 CORDIC algorithm
requires a similar number of iteration to compute the total rotation but 3.1. Butterfly processing unit
proposed radix-16 CORDIC algorithm has less critical data path and it
requires 2N full adders for one rotator as compared to 3N full adders of For N-point FFT computation, log2 N stages are required. In
pipelined architecture, each stage has a butterfly unit to perform a but-
94
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
Table 4
Comparison of proposed radix-16 CORDIC algorithm with other proposed CORDIC algorithms.
CORDIC Parameters ROM size to Scale factor
Algorithm store tan−1 () compensation
No. of FAs∕ Total Overhead Critical Path FAs for 16-
Iterations rotator FAs delay bit precision
Radix-2 N N N2 No overhead tCPA 256 N × N Constant, shift-add
CORDIC [20]
N N2 3N
Radix-4 2
N 2
3_1 Mux + tCPA 128 2
×N Variable, ROM
N +1
CORDIC [23] 4_bit Comp. 3 4 × N, shift-add
N +3 3N (N +3)
Double Step 2
3N 2
Not reported 2tCPA 456 (N + 4) × N Constant, shift-add
Branching
CORDIC [24]
N 3 2 6N
Hybrid 4
3N 4
N 2∗ (3_1Mux + 2tCPA 192 4
×N Variable, ROM
N
CORDIC [25] 4_bitComp.) 9 16 +1 × N, shift-add
N 2N 2 5N
Radix-8 3
2N 3
4_1 Mux + tCPA + tFA 170 3
×N Variable, ROM
N +1
CORDIC [21] 5_bit Comp. 58 × N, shift-add
N N2 9N
Radix-16 4
2N 2
5_1 Mux + tCPA + tFA 128 4
×N Variable, ROM
N
CORDIC 5_bit Comp. 9 16 +1 × 5N ,
4
shift-add
95
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
Fig. 7. Proposed architecture of Double-path Delay Commutator (DDC) R2DIF FFT algorithm.
96
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
Table 5
Comparison of the proposed architecture to other architectures for the computation of an N-point FFT in terms of required arithmetic and
delay resources, and latency.
FFT Algorithm Arithmetic Resources (n = log2 N) Total Adder for Delay Resources Latency
16-bit accuracy
Complex Complex CORDIC Total Adder for
Multiplier Adder Rotator B-bit accuracy
N
R2MDC [11] n − 1 2n 0 3B(n − 1) + 8n − 4 56n − 52 N 2
3N 11N
R2MDC [14] n − 2 2n 0 3B(n − 2) + 8n − 8 56n − 104 2
−2 8
−4
R2M2 DF [13] n − 2 2n 0 3B(n − 2) + 8n − 8 56n − 104 N − 2 N
N
R2DDC [12] 0 2n n − 2 3B(n − 2) + 4n 56n − 48 N 2
+n−2
7B N −16
Proposed, R2M• DC 0 2n n − 2 4
(n − 2) + 4n 32n − 56 N − 6n + 20 2
+n+ 13
Table 6
Comparison of resource utilization of proposed radix-16 CORDIC based rotator
with Xilinx IP core of radix-2 CORDIC rotator and complex multiplier.
Design Hardware Resource Utilization Latency
Slice Slice Total
LUTs registers Slices
Xilinx ComplexMultiplier IP V.6 1019 940 1959 5
Xilinx CORDICIP V.6 1146 1201 2347 16
Proposed radix-16CORDIC based rotator 954 336 1290 7
Table 7
Performance comparison of the proposed CORDIC rotator based R2DIF FFT
algorithm and conventional memory-based architectures.
FFT Points Conventional design Proposed design % Improvement in Performance
Slice Latency Slice Latency Slice Latency
512 14,844 520 11,302 270 23.86 48.08
1024 20,195 1033 13,449 527 33.4 48.98
2048 24,827 2058 16,273 1040 34.45 49.47
4096 34,383 4107 20,615 2065 40.04 49.72
are implemented for 16-bit fixed-point precision for real and imaginary Table 8
value, and it is synthesized for Virtex-7 series FPGA for the compar- SNR comparison of the proposed design and Matlab implementation.
ison purpose. Table 6 shows the resource utilization of our proposed
FFT Point SNR in dB PSNR in dB
radix-16 CORDIC based rotator and Xilinx IPs of complex multiplier
Proposed Implementation Matlab Implementation
and CORDIC based rotator. Table 6 suggests that our proposed radix-16
CORDIC based rotator has 34% less hardware utilization as compared to 512 40.18 41.22 62.65
1024 45.16 49.57 60.34
Xilinx IPs of complex multiplier [31] and 45% as compared to CORDIC 2048 37.45 41.09 59.06
rotator [35]. Comparison between the proposed design and conven- 4096 44.92 49.52 57.53
tional fully pipelined DDC implementation [12] of variable length FFT
algorithm is presented in Table 7 considering two parameters, resource
utilization in terms of slices used and latency in terms of clock cycles
of our proposed architecture is measured and compared with the results
used to compute the FFT of an input sequence. The conventional fully
obtained using the Matlab function fft in double-precision floating-point
pipelined DDC architecture presented in Ref. [12], has used LUTs to
format. First, the input signal, xt = 0.5 × (sin(2𝜋 f1 t) + sin(2𝜋 f2 t)) is
store the complex twiddles. It has used the complex multiplier, imple-
sampled with sampling rate fs where, f1 = 0.2 MHz, f2 = 0.3 MHz,
mented using CLBs, to carry out the twiddle multiplications. Table 7
and fs = 1 MHz. Later, the sampled analog input is converted to digital
suggests that our proposed implementation uses an average of 32.96%
in FXP(16,14) fixed-point representation and applied to our proposed
fewer slices as compared to conventional pipelined DDC architecture.
architecture. The output of the proposed architecture is brought back
Reduction in slices is possible as our proposed radix-16 CORDIC based
to Matlab to measure the SNR. The SNR is defined as the ratio of the
rotator has less hardware utilization as compared to the complex multi-
power of the input signal to the sum of the power of all frequency com-
plier, as shown in Table 6. The number of stages in FFT design depends
ponents (noise) in the output which are not the part of the input signal.
on the number of points of the FFT, and each stage has one rotator
The peak signal-to-noise ratio (PSNR) is also measured considering the
to perform the complex multiplication. Hence, if the size of the FFT
Matlab output as reference. The peak signal-to-noise ratio (PSNR) is
increases, percentage saving in the slice increases, which can be seen
defined as the ratio of a maximum power of input signal to mean of the
in Table 7. As shown in Table 7, the proposed implementation has 49%
sum over all squared differences.
less computational latency compared to conventional pipelined DDC
( )
architecture as our proposed design processes two data samples in one A2max
clock cycle. For accuracy measurement, the signal-to-noise ratio (SNR) PSNR = 10 log 1 ∑N (13)
2
N i=1 (Ai − Bi )
97
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
where Amax is the maximum amplitude of the input, Ai is the output are also added in Table 9. As the order of the input and output sam-
of the proposed architecture and Bi is the output of the Matlab func- ples depend on different applications, hardware required to order the
tion fft. Table 8 shows the comparison of the proposed architecture data before the FFT and after the FFT is not considered for compar-
with the Matlab implementation for different length of the FFT. For ison purposes. Many architectures use the dedicated functional block
4096 point FFT, the loss in SNR is 4.6 dB and 1.04 dB for 512 points of the FPGA, i.e., DSP and block RAM. For comparison purposes, these
FFT due to quantization and truncation error. Fig. 9 shows the output functional blocks are converted into their equivalent slices. One block
of the proposed architecture and Matlab implementation. The PSNR RAM is approximated to the 550 slices as one slice can be configured to
of the proposed design is given in Table 8.The detailed comparison store 64-bits of data. One DSP block can be approximated to 385 slices
between our proposed pipelined architecture and other pipelined archi- as the height of the DSP block matches the one block RAM, and each
tectures is presented in Table 9 considering these parameters: input DSP block aligns horizontally with an 18K block RAM. Architectures,
data width, resource utilization, latency, throughput, and operating presented in Refs. [9,13–15,34], use the complex multiplier to imple-
frequency. The length of the FFT is considered for the classification ment the butterfly processing unit. The butterfly processing unit of the
of the FFT architecture. The resources used to implement the butter- architecture presented in Refs. [4,12] is implemented using a radix-2
fly operation and architecture used to implement the FFT algorithm CORDIC algorithm based rotator. The butterfly processing unit of our
Table 9
Performance comparison of the proposed R2DIF FFT implementation with other implementation.
R2: Radix-2 FFT, FP: Fully pipelined, PP: Parallel Pipelined, CA: Complex Adder, CM: Complex Multiplier, a : Virtex-2, b : Virtex-6.
c
: Virtex-7, d : Virtex-5, FF:Feed Forward, MDC: Multipath Delay Commutator, M2 DF: Mixed-decimation multipath Delay Feedback.
MDC• :Proposed optimized MDC architecture to compensate the latency of R16CORDIC, PSI: Pipelined Streaming I/Os.
SDC-SDF: Single-path Delay Commutator-Feedback, DDC: Double-path Delay Commutator, # : Optimized radix-2 CORDIC.
98
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
Fig. 10. The comparison between our proposed FFT implementation with radix- The authors declare that they have no known competing financial
2 CORDIC based FFT implementation and Xilinx FFT IP core. interests or personal relationships that could have appeared to influence
the work reported in this paper.
99
A. Changela et al. Integration, the VLSI Journal 73 (2020) 89–100
[21] A. Changela, M. Zaveri, A. Lakhlani, ASIC implementation of high performance [28] M.D. Ercegovac, J. Muller, Digit-recurrence algorithms for division and square root
radix-8 CORDIC algorithm, in: 2018 International Conference on Advances in with limited precision primitives, in: The Thrity-Seventh Asilomar Conference on
Computing, Communications and Informatics (ICACCI), 2018, pp. 699–705, Signals, Systems Computers, 2003, vol. 2, 2003, pp. 1440–1444, https://doi.org/
https://doi.org/10.1109/ICACCI.2018.8554883. 10.1109/ACSSC.2003.1292224.
[22] A. Changela, M. Zaveri, A. Lakhlani, FPGA implementation of asynchronous [29] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, Oxford
mousetrap pipelined radix-2 CORDIC algorithm, in: 2018 International Conference University Press, Oxford, UK, 2000.
on Current Trends towards Converging Technologies (ICCTCT), 2018, pp. [30] S.-M. Kim, J.-G. Chung, K.K. Parhi, Low error fixed-width CSD multiplier with
252–258, https://doi.org/10.1109/ICCTCT.2018.8551112. efficient sign extension, IEEE Trans. Circ. Syst. II: Anal. Dig. Sign. Proc. 50 (12)
[23] E. Antelo, J. Villalba, J.D. Bruguera, E.L. Zapata, High performance rotation (2003) 984–993, https://doi.org/10.1109/TCSII.2003.820231.
architectures based on the radix-4 CORDIC algorithm, IEEE Trans. Comput. 46 (8) [31] I. Xilinx, LogiCORE IP Complex Multiplier v6.0, Product Specifications PG104,
(1997) 855–870, https://doi.org/10.1109/12.609275. 2015. November 18th.
[24] D.S. Phatak, Double step branching CORDIC: a new algorithm for fast sine and [32] J.G. Proakis, D.K. Manolakis, Digital Signal Processing, fourth ed., Prentice-Hall,
cosine generation, IEEE Trans. Comput. 47 (5) (1998) 587–602, https://doi.org/ Inc., USA, 2006.
10.1109/12.677251. [33] C. Ingemarsson, O. Gustafsson, SFF-the single-stream FPGA-optimized feedforward
[25] R. Shukla, K.C. Ray, Low latency hybrid CORDIC algorithm, IEEE Trans. Comput. FFT hardware architecture, J. Sign. Process. Syst. 90 (11) (2018) 1583–1592,
63 (12) (2014) 3066–3078, https://doi.org/10.1109/TC.2013.173. https://doi.org/10.1007/s11265-018-1370-y.
[26] Y.-N. Chang, An efficient VLSI architecture for normal i/o order pipeline FFT [34] I. Xilinx, LogiCORE IP Fast Fourier Transform v8.0, Product Specifications DS808,
design, IEEE Trans. Circ. Syst. II: Express Briefs 55 (2008) 1234–1238. 2012. July 25th.
[27] P.K. Meher, J. Valls, T.B. Juang, K. Sridharan, K. Maharatna, 50 years of CORDIC: [35] I. Xilinx, LogiCORE IP CORDIC v6.0, Product Specifications IPPG105, 2017.
algorithms, architectures, and applications, IEEE Trans. Circ. Syst. I: Reg. Pap. 56 December 20th.
(9) (2009) 1893–1907, https://doi.org/10.1109/TCSI.2009.2025803.
100