A Memory-Based FFT Processor Design With Generalized Efficient Conflict-Free Address Schemes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO.

6, JUNE 2017 1919

A Memory-Based FFT Processor Design With


Generalized Efficient Conflict-Free
Address Schemes
Kai-Feng Xia, Bin Wu, Tao Xiong, and Tian-Chun Ye

Abstract— This paper presents the design and implementation banks to hold intermediate results, processing the data recur-
of memory-based fast Fourier transform (FFT) processors with sively regardless of computation length. Memory-based FFTs
generalized efficient, conflict-free address schemes. We unified achieve better hardware efficiency, compared with pipeline
the conflict-free address schemes of three different FFT lengths,
including the single-power points, the common nonsingle-power ones. Nevertheless, the throughput of memory-based FFTs is
points, and the nonsingle-power points applied with a prime usually restricted by the butterfly radix and concurrent data
factor algorithm. Though the three cases differ in terms of decom- access contentions.
position, they are all compatible with memory-based architecture Conflict-free address schemes for concurrent data access
by the way of the proposed address schemes. Moreover, the from different memory banks become an essential problem.
decomposition algorithm utilizes a method, named high-radix–
small-butterfly (HRSB), to decrease the computation cycles and A parity bit check method for one or more radix-2 PEs is first
eliminate the complexity of the processing engine. In addition, an introduced in [3] and [4]. An in-place strategy from [5] reduces
efficient index generator, a simplified multipath delay commuta- the total memory storage to minimum N. In [6], a mixed
tor engine, and a unified Winograd Fourier transform algorithm radix-4/2 in-place scheme makes the input and output bits
butterfly core were also designed. We designed two FFT examples symmetric, and then, the conflict-free scheme is extended to a
in long-term evolution system to verify the availability of the
address scheme, including a 2n (128–2048)-point FFT unit and a mixed-radix algorithm. In [7], a multiple radix-2 PE scheme
35 different point (12–1296) DFT unit. Compared with previous is demonstrated to increase the throughput of FFT processors.
works with similar address schemes, this paper supports more However, the methods in [6] and [7] are only suitable for
generalized lengths and achieves more flexible throughput. power-of-two point FFTs. In [8], a single- and multiple-PE
Index Terms— Conflict-free address scheme, long-term method for arbitrary radix-b algorithm is discussed. However,
evolution (LTE), memory-based fast Fourier transform (FFT) it requires memory in every stage.
processor, prime factor algorithm (PFA), Winograd algorithm. In [9]–[11], a multipath delay commutator (MDC) archi-
tecture with high radix is used to replace the complex PE
I. I NTRODUCTION in conventional memory-based FFTs. This method provides
low-power dissipation, high data rates, good computational
T HE fast Fourier transform (FFT) is a commonly used
algorithm in digital signal processing areas, such as imag-
ing applications and communication systems. Image process-
efficiency, and refined length flexibility. However, none of
them provides a detailed conflict-free address scheme. In [12],
ing requires computation sizes as high as 222 [1], whereas a generalized conflict-free address scheme with single or
communication systems apply several computation lengths multiple radix-2q MDC architectures for 2n -point FFTs is
simultaneously, such as the sizes of 128–2048 in 3GPP long- illustrated. However, this scheme does not extend the principle
term evolution (LTE) systems [2]. Consequently, long and to arbitrary-length FFTs and more general decomposition
variable-size FFT applications are becoming popular. algorithms. Generalized mixed-radix algorithms that support
FFT processors can be categorized into two architectures: both traditional 2n -point and prime-sized FFTs are proposed
pipeline- and memory-based architectures. Memory-based in [13] and [14]. Although [13] and [14] obtain the memory
FFTs pass the data multiple times through a single butterfly bank and address rules according to the decomposition algo-
processing element (PE) or set of PEs, with several memory rithms, they do not state the data distribution procedure clearly,
especially when prime factor algorithm (PFA) is applied within
Manuscript received April 22, 2016; revised October 23, 2016, the FFT. Moreover, they cannot fully support the continuous-
December 7, 2016, and January 21, 2017; accepted January 31, 2017. Date flow working mode.
of publication March 3, 2017; date of current version May 22, 2017. This
work was supported by the National Major Science and Technology Program This paper presents a memory-based FFT processor design
of China under Grant 2014ZX03001011-002. This paper was presented at methodology with a generalized conflict-free address scheme
the IEEE International Symposium on Circuits and Systems Proceedings, for arbitrary-length FFTs. We unify the conflict-free address
Montreal, Canada, 2016.
The authors are with the Institute of Microelectronics, Chinese schemes of three different FFT lengths, including the single-
Academy of Sciences, Beijing 100029, China (e-mail: xiakaifeng@ime.ac.cn; power point (SPP) FFTs, the common nonsingle-power
wubin@ime.ac.cn). point (NSPP) FFTs, and the NSPP FFTs applied with the
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. PFA, to the same address generation format. The memory
Digital Object Identifier 10.1109/TVLSI.2017.2666820 bank index and the internal address are all generated by
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1920 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

modulo and multiplication operations of the decomposition


digits. Moreover, a decomposition algorithm, named high-
radix–small-butterfly (HRSB), utilizes high-radix PE to reduce
the total number of computation cycles and small butterfly
units (BUs) to eliminate the complexity of the computation
engine. Thus, the FFTs can complete N-point computations
within N cycles to support the continuous-flow mode with
low complexity. In the previous methods [13], [14], two
successive identical symbols produce two slightly different
results, because the factorization changes after each transform.
To avoid this problem, we apply the same factorization to
every data symbol. Furthermore, an efficient index generator,
a simplified configurable MDC unit, and a unified Wino- Fig. 1. Architecture of a multiple-PE, memory-based FFT processor [12].
grad Fourier transform algorithm (WFTA) butterfly core for
point-2, 3, 4, 5 DFTs are designed for the prime-factor FFTs.
We designed two FFT examples in LTE systems, including where · denotes modulo-N operation. If N1 and N2 are
a 2n -point (128–2048) FFT unit and a DFT unit with 35 dif- relatively prime, a Ruritanian map for the input n and a
ferent points (12–1296). The techniques proposed previously Chinese remainder theorem (CRT) map for the output k are
can be extended to arbitrary-length FFTs. used. Therefore
The remainder of this paper is organized as follows.
Section II reviews the background information related to K 1 = N2 ; K 3 = N2 N2−1 , so N2 N2−1  N1 = 1
this paper. In Section III, the address schemes for arbitrary- K 2 = N1 ; K 4 = N1 N1−1 , so N1 N1−1  N2 = 1. (3)
length SPP FFTs, NSPP FFTs, and NSPP FFTs with the
PFA are illustrated, respectively. In Section IV, the details Then, the definition of the DFT can be
of the FFT architectures for LTE systems are presented. The 
N−1
N1 N1 N1−1 n 2 k2 N2 N2 N2−1 n 1 k1
implementation results and the performance comparisons are C(k1 , k2 ) = x(n 1 , n 2 )W N WN
analyzed in Section V. The conclusion is given in Section VI. n=0

N−1
N1 N1−1 n 2 k2 N2 N2−1 n 1 k1
II. BACKGROUND = x(n 1 , n 2 )W N2 W N1
n=0
A. Methods to Support Continuous-Flow ⎧ ⎫
2 −1 ⎨ N
N 1 −1 ⎬
Memory-Based FFTs = x(n 1 , n 2 )W Nn11 k1 W Nn22 k2 . (4)
⎩ ⎭
Memory-based FFTs usually include three stages: the input n 2 =0 n 1 =0
sampling, intermediate computations, and the output reorder-
This is one of the index mapping methods of the PFA. On the
ing. The continuous-flow mode should be available only if
other hand, if N1 and N2 are not relatively prime, we can
the computation time is not longer than the input/output time.
obtain
Fig. 1 is a typical parallel-PE memory-based FFT adopting the
in-place strategy, which consists of L parallel radix-r PEs and K 1 = N2 , K 2 = 1, K 3 = 1, K 4 = N1 . (5)
two P-bank memory groups. Consequently, the continuous-
flow memory-based FFT should satisfy Consequently, the DFT is given as
N 
N−1
· logr N  ≤ N (1) C(k1 , k2 ) = x(n 1 , n 2 )W NN1 n2 k2 W NN2 n1 k1 W Nn2 k1
r·L
n=0
where logr N  is the recursive computation stages. The total ⎧ ⎫
2 −1 ⎨ N
N 1 −1 ⎬
number of parallel paths of the PEs is P = r · L. As a result, = x(n 1 , n 2 )W Nn11 k1 W Nn2 k1 W Nn22 k2 . (6)
increasing the radix r or the PE parallelism L represents two ⎩ ⎭
n 2 =0 n 1 =0
basic methods of satisfying (1).
This is the famous Cooley–Tukey algorithm, which is also
called the common-factor algorithm (CFA).
B. Index Mapping Methods of the PFA and CFA Both the PFA and the CFA decompose the N-point DFT
The definition of a discrete Fourier transform (DFT) is given
 N−1 into N2 DFTs with N1 -point and N1 DFTs with N2 -point
by C(k) = nk
n=0 x(n)W N . In [15] and [16], the method successively. They both utilize an input index map for com-
for decomposing one-dimensional DFTs into multidimensional putation and an output reorder map for natural outputting.
DFTs is demonstrated. If the length N is decomposed into The mapping method of the PFA is more complicated than
N1 N2 , the index mapping can be expressed as follows: that of the CFA. However, there are twiddle factors W Nn2 k1
between the two-level DFTs for the CFA but not for the PFA.
n = K 1 n 1 + K 2 n 2  N , n 1 , k1 = 0, 1, . . . , N1 − 1 The mapping methods of (3) and (5) can both be extended to
k = K 3 k1 + K 4 k2  N , n 2 , k2 = 0, 1, . . . , N2 − 1 (2) dimensions greater than two.
XIA et al.: MEMORY-BASED FFT PROCESSOR DESIGN WITH GENERALIZED EFFICIENT CONFLICT-FREE ADDRESS SCHEMES 1921

C. Conflict-Free Address Method Proposed by Tsai and Lin III. C ONTINUOUS -F LOW C ONFLICT-F REE A DDRESS
S CHEME FOR D IFFERENT C OMPUTATION L ENGTHS
A generalized conflict-free memory address scheme for
memory-based FFTs with parallel arithmetic processing units In this section, we will propose the continuous-flow conflict-
is presented in [12]. Assume that the N = 2m FFT free address schemes for different FFT computation lengths,
employs a mixed-radix algorithm with the maximum radix-2q including SPP FFTs, common NSPP FFTs, and NSPP FFTs
MDC units. The FFT is computed in total S stages, where applied with PFA. Although these three cases apply different
S = m/q and · is the ceiling function. At stage s, decomposition algorithms, their conflict-free address methods
the adopted radix is 2qs , with qs ≤ q, and the number of can be unified by a single scheme. The address schemes for
radix-2qs BUs at current stage is 2m−qs . bsI , bsO denote the the three cases build progressively from one to the next.
different bits of the upper and lower binary indices for the
input and output data of the radix-2qs MDC at stage s. A. Address Scheme for Single-Power Point FFTs
Consequently, UsM = {bsI , bsO } represents a set containing The conflict-free address scheme for SPP FFTs pre-
joint constraints in stage s for nonconflict data access. Since sented by Tsai and Lin [12] is illustrated only for
in-place strategy is applied, address representation of the even 2n -point FFTs. We can extend the method to arbitrary-length
symbol for the same memory is in the reversed binary format single-power r n -point FFTs by replacing the XOR operations
of the odd symbol. Thus, bit constraints for forward and in (8) and (9) with r -modulo additions. Assume that the
reversed address representations, UsF and UsR , in one stage N = r m -point FFT employs a mixed radix with the maximum
should be considered concurrently. EXCLUSIVE OR (XOR) radix-r q MDC units. The parallelism of the MDC PE is
operations of all the decimation constraints from stage 0 to P = r p , and P must divide r m−q , the number of butterflies at
S − 1 can be used to obtain the memory bank index. each stage. As a result, formation (1) can be transformed into
For simplicity, we use 128-point FFT to illustrate. If we
apply radix-23 algorithm, the 128-point FFT can be factorized N m
· logr q r m
 ≤ N → ≤ r p+1 . (10)
into three stages: 2 × 23 × 23 . The forward and reversed r ·r p q
constraints in each stage can be given by However, as the throughput of SPP FFTs is associated with
the PE number closely, it is critical to explore the relation-
U0F = {b[6]}, U1F = {b[5], b[3]} ship between the number of constraint sets and the applied
U2F = {b[2], b[0]}, U0R = {b[0]} algorithm under different cases.
1) Only Increasing the PE Parallelism or the Butterfly
U1R = {b[1], b[3]}, U2R = {b[4], b[6]}. (7)
Radix: If we only increase the parallelism of the PE P, while
keeping the butterfly radix as the simplest radix-r , then q = 1,
There are three disjoint sets of constraints, i.e., G 0 = U0F ∪ and (10) will become m ≤ r p+1 . For any stage s, the r parallel
U2R , G 1 = U0R ∪ U2F , and G 2 = U1F ∪ U1R . If there is only data indices are only different in the (m −s −1)th digit. For the
one MDC PE, the bank mapping function B is the combined m-level forward address representation, there are m different
XOR operation of the three constraints constraint sets and vice versa

B(b) = XOR{G 0 , G 1 , G 2 }. (8) UsF = d[m − s − 1], UsR = d[s]. (11)


As a result, there are in total m disjoint constraint sets for
The physical address in the internal memory bank for each the radix-r FFT. Since there are r p different radix-r BUs,
index can be derived by eliminating a bit, in the constraint sets, it requires at least p + 1 separate constraint sets, p for
of the binary index.
 One of the physical address assignments MDC units and one for r paths in one MDC, to distribute
can be A(b) = 6i=1 b[i ]2i . the data to different memory banks. Because the number of
However, for a computation engine with P parallel MDC separated constraint sets in this case is m, p + 1 ≤ m must
PEs, if the constraint that distinguishes the upper and lower be satisfied. Therefore, the parallelism p must satisfy the
paths of a radix-2 butterfly is in one set, the constraints that following restriction:
differentiate the memory banks for different MDC PEs must
be in other sets in any stage s. As a result, the 128-point FFT logr m ≤ p + 1 ≤ m. (12)
can include at most eight parallel memory banks and four If we only increase the radix of the BU and keep the PE
radix-23 MDC PEs. Consequently, the bank mapping function parallelism to one, then (10) will be m/q ≤ r . Because
B P (b) for the eight banks is there is only one PE, the number of constraint sets needed to

distribute the data to different memory banks is simply one.
B P [d] = XOR(G d ), d = 0, 1, 2 Regardless of the value of m, there will be a q to satisfy the
 (9)
B P (b) = 2d=0 B P [d]2d . above two conditions. For 222 -point FFTs in [1], if radix-2
is employed, the parallelism p must range from 4 to 21 to
Correspondingly, the physical addresses in each memory bank support the continuous-flow mode, i.e., at least 16 parallel
can be obtained by dropping P + 1 bits of the binary index, radix-2 PEs must be applied. However, if there is only one
one from each BU, the butterfly radix must be at least radix-212. However,
constraint set.One format of the addresses can
be A P (b) = 6i=5 b[i ]2i + 3j =2 b[ j ]2 j . the hardware consumption of the butterfly engine will be huge
1922 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

constraint set number is associated with the PE parallelism,


the achievable throughput of the SPP FFT is also decided by
the applied algorithms and FFT sizes together.

B. Address Scheme for Nonsingle-Power Point FFTs


For arbitrary NSPP FFTs, a mixed-radix algorithm is usually
employed. Assume that the FFT is of size N = N1 N2 N3 ,
N3 ≥ N1 , N3 ≥ N2 , n 1 = 0, . . . , N1 − 1, n 2 = 0, . . . , N2 − 1,
n 3 = 0, . . . , N3 − 1. N1 , N2 , and N3 are common decomposi-
tion components. We still apply the Cooley–Tukey algorithm
in this decomposition. If the decomposition components are
not equal with each other, the forward and reversed address
representations are asymmetric. Thus, the in-place strategy
will be invalidated and the address method for SPP FFT cannot
be used directly. However, the r -modulo operation to generate
the memory bank index in SPP FFT can be transplanted into
this situation. For the first-level DFT applied radix-N1 , the
Fig. 2. Forward and reversed address constraint representations for different N1 parallel data are only different in the first digit n 1 while
lengths. (a) mod(m, 3) = 0. (b) mod(m, 3) = 1. (c) mod(m, 3) = 2.
keeping the other digits the same. It is the same for parallel
data in digits n 2 and n 3 , when computing in the second and
only if the PE parallelism or the butterfly radix becomes third levels with radix-N2 and radix-N3 , respectively. Because
excessive. N3 is the largest factor, the memory bank number must be set
2) Increasing the PE Parallelism and the Butterfly Radix to N3 or larger. Otherwise, there will be data contentions at
Simultaneously: The most efficient way to realize (10) is to use the third level. If there is only one BU, similar to the conflict-
the high-radix algorithm as well as the parallel-PE techniques. free address scheme for SPP FFTs in (8), the bank address of
If there are in total K separate constraint sets for all the stages, the mixed-radix algorithm can be calculated by the N3 -modulo
there must be p + 1 ≤ K . For a hardware-efficient MDC operation of all the digits
unit, q will appropriately be 3 or 4. We will use q = 3 for
discussion. There are three different m types, including mod bank = n 1 + n 2 + n 3  N3 . (14)
(m, 3) = 0, mod(m, 3) = 1 and mod(m, 3) = 2, which have The physical address in each memory bank is found by
different constraint set numbers. deleting the digit of n 3 in the digital representation [n 1 ,n 2 ,n 3 ].
According to the method in [12], the different digits of Thus, the address can be expressed as
the concurrent data for the input and output of the forward
address representation of radix-r 3 are d0 and d2 in the digit address = n 1 N2 + n 2 . (15)
group (d0 , d1 , d2 ). d0 and d2 denote the address constraint in
The physical bank and address assignment in [13] has the same
one stage. The radix-r 3 algorithm is utilized in every stage
address format as (14) and (15); however, they are obtained
in the mod(m, 3) = 0 type. As shown in Fig. 2(a), the
through another method.
digits with a shadow in a dotted block region represent one
The multiple-PE strategy can also be used in the NSPP
separate constraint set. The reversed address representation
FFTs. Similar to the SPP FFTs, three independent decimation
is to reverse the digit position of the forward. As a result,
constraint sets exist for N1 N2 N3 . If there are P BUs for
the forward address constraint representation is the same as
the first-stage computation, P must divide the current stage’s
that of the reversed. Therefore, there are (m/3) different sets
BU number N2 N3 . Because n 1 is used to distinguish the
for the mod(m, 3) = 0 type. For the mod(m, 3) = 1 and
parallel data of radix-N1 BU, the digits to distribute the data
mod (m, 3) = 2 types, the first stage is not radix-r 3 but instead
to different BUs must be n 2 or n 3 . Take 24 = 2 × 3 × 4-point
is radix-r or radix-r 2. Thus, the forward and reversed address
FFT for example. The memory bank and address are generated
constraint representations are not the same, as shown in
by (14) and (15). If there are two radix-2 PEs for the first stage,
Fig. 2(b) and (c). There are (m/3) and (m/3)−1 different
the memory bank can still be 4. The first binary bit of digit 4
constraint groups for mod(m, 3) = 1 and mod(m, 3) = 2,
can be used to differentiate the two PEs. However, this rule is
respectively. Therefore, the parallelism of PE should satisfy

only available when the total path number P N1 is not larger
logr  m3  ≤ p + 1 ≤  m3 , mod(m, 3) = 0, 1 than the bank number N3 . Thus, it is hard to extend multiple-
(13) PE bank generation rule in (9) to common NSPP FFTs.
logr  m3  ≤ p + 1 ≤  m3  − 1, mod(m, 3) = 2.
As mentioned in [13], if the in-place strategy is used to
For 222 -point FFTs, the parallelism should satisfy 2 ≤ p ≤ 7 if minimize the memory storage, a reversed decomposition order
the radix-23 algorithm is employed, i.e., at least four radix-23 of the former symbol should be applied to the latter. If the
MDC units will be required. decomposition of the former symbol is N1 × N2 × N3 , the
As a result, the number of constraint sets can be deduced latter must be N3 × N2 × N1 to satisfy the in-place strategy.
by the specific algorithms and FFT lengths. Because the However, this will lead to two slightly different results in the
XIA et al.: MEMORY-BASED FFT PROCESSOR DESIGN WITH GENERALIZED EFFICIENT CONFLICT-FREE ADDRESS SCHEMES 1923

fixed-point model, just as the decomposition algorithms are


different. As a result, a same decomposition manner should
be applied to all the data symbols, and the input data x[i ]
cannot be placed in the location of the output data X[i ] of
the previous symbol. Therefore, an additional memory group
will inevitably be used to reorder the output results. Compared
with the SPP FFTs with only 2N memory storage in Fig. 1,
there must be 3N memory units in the NSPP FFTs. Two for
the ping-pong switch of the input sampling and one for the
output reordering.
In the NSPP FFTs, using high-radix BU is a more efficient
way than parallel-PE technique to decrease the computation
cycles and to increase the throughput, because the former
method has less address controlling complexity than the latter.
If the decomposition of a 120-point FFT is 2 × 3 × 4 × 5, then
the total computation time of this method is 60+40+30+24 =
154 cycles. We can combine some trivial factors together and
make the decompositions compact, such as by using 4 × 5 × 6
or 3 × 5 × 8. Then, the computation time should reduce to
30 + 24 + 20 = 74 cycles and 40 + 24 + 15 = 79 cycles,
respectively. This mechanism is the same as that in SPP
FFTs, only increasing the butterfly radix while keeping the
BU number to one. We can apply high-radix algorithm for
the BU, and small-connected MDC BUs to implement the
processing engine. The radix-8 BU can be implemented by
a connected radix-2–4 MDC unit, and then, the computation
time is 40 + 24 + 30 = 94 cycles. This is the concept of
the proposed decomposition scheme named HRSB that will
be introduced in Section IV-B.

C. Address Scheme for Nonsingle-Power Point


FFTs Applied With PFA
Fig. 3. Dataflow graph of the 12-point DFT combined with the PFA (a) with
When PFA is employed to reduce the twiddle factor multi- bank conflict and (b) without bank conflict.
plications, the address scheme for NSPP FFTs in Section III-B
will meet some problems. Because the index mapping of PFA
is quite different with CFA, the data distribution rule is also Subsequently, it is essential to explore the common rule
distinct. Take 12 = 3 × 4-point FFT as an example. When of the inverse PFA mapping. For both PFA and Cooley–
CFA is applied, the decomposition factors n 1 and n 2 of index Tukey algorithms, the multidimensional input and output map-
n can be easily obtained to generated the memory bank and pings can be derived from the two-dimensional mappings in
address. The address scheme for 12-point FFT is Section II-B. Assume that DFT length N is factorized into
m factors, N = N1 N2 . . . Nm . For CFA, the input and output
bank = n 1 + n 2 4 , address = n 1 . (16) index mapping is
Fig. 3 is the two-level dataflow graph of 12-point FFT,

m−1
in which n is the input number index and (b, a) denotes the n= n i (Ni+1 , . . . , Nm ) + n m
corresponding memory bank and address. However, if the PFA i=1
is applied and data are still stored in the same position as 
m
that in the CFA, data contentions will appear in stage 1 as k= ki (N1 , . . . , Ni−1 ) + k1 . (17)
shown in Fig. 3(a). Some data will originate from the same i=2
memory bank. Thus, a new data distribution manner should
In (17), n i , ki = 0, . . . , Ni − 1. n i and ki are the m input and
be explored. The input PFA index mapping of 12-point FFT
output factors of the decomposition algorithm. For PFA, the
is n = 4n 1 + 3n 2 12 . The input index n corresponds to the m factors are mutually prime, then the input and output index
decomposition components n 1 and n 2 one by one. Hence, n 1
mapping is
and n 2 can be used to generate the memory bank and address m  m 
as (16). When the input index n is stored at the position  
generated by n 1 and n 2 , data conflicts in each stage will be n= n i Mi k= ki ai Mi . (18)
avoided as shown in Fig. 3(b). In order to obtain n 1 and n 2 i=1 N i=1 N
to generate the address scheme when n is known, the inverse In (18), Mi = N/Ni , ai Mi  Ni = 1 should be satisfied. Mi and
mapping rule of n = 4n 1 + 3n 2 12 must be obtained. ai Mi are index mapping parameters for the input and output,
1924 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

respectively. For simplicity, ai is usually set to the smallest


integer that makes ai Mi  Ni = 1 satisfied. When only CFA is
applied, it is easy to obtain the decomposition components n i
by divisions stage by stage, if index n is given. However, for
the PFA, it is more complicated. The input and output inverse
PFA index mapping rule can be derived in [16] and [17], and
they are expressed as
n i = nai  Ni ki = k Ni . (19)
In (19), the parameter ai has the same meaning with that
in (18). When all the decomposition factors n i are acquired,
the general memory bank indices and addresses are able to
generated as that in common NSPP FFTs. Assume that Nm is
the largest decomposition component, thus, the memory bank
number is set to Nm . The memory bank and address generation Fig. 4. Block diagram of the proposed FFT processor architecture.
rule is
m 
 
m−1
bank = ni address = n i (Ni+1 , . . . , Nm−1 ).
i=1 i=1
The architectures of the 2n -point FFT and the NSPP DFT
Nm
are similar. Only some details are different. The entire FFT
(20)
processor architecture is shown in Fig. 4. It consists of the
This address generation rule can also be adopted by the output following main parts: an input and an output index vector
index mapping for reordering by changing the decomposition generator, a computation address generator, three different
components n i to ki . For the input index 1 in 12-point memory bank groups, a PE unit applied with the HRSB
FFT, if CFA is applied, the factors n 1 and n 2 are 0 and 1, scheme, some commutators between the memory and the
respectively. The memory bank and address (b, a) generated PE, some prestored twiddle factors, and FFT size parameters,
by (20) is (1, 0). When PFA is applied, n 1 and n 2 can be and an exponent scaling unit. The dashed lines represent the
obtained by n 1 = n3 = 1, n 2 = 3n4 = 3. Consequently, control signals while the real lines denote the flowing data.
the bank and address (b, a) of index 1 for this case is (0, 1). The input index vector generator distributes the input data
However, if there are prime factors and common factors in to different memory banks without data conflicts, and the
the decomposition concurrently, both the index mapping rule output one reorders the output data to a natural sequence.
for CFA and PFA should be applied. For N = 2 × 3 × 4 × 5, The computation address generator obtains all the concurrent
the factors 3, 4, and 5 are prime with each other, while 2 and data of each cycle and stores back the intermediate results.
4 are not prime mutually. The PFA index mapping rule cannot Memory groups 1 and 2 are in a ping-pong mode to hold
be used directly. Thus, we can change the decomposition into two continuous data symbols in input sampling, and memory
N = 3 × 5 × 8, and 8 can further be decomposed into 2 × 4 group 3 is used to output the computed data in right order.
by CFA. That is, the PFA is used between the prime factors, Three memory groups only exist in NSPP FFTs. For SPP
while CFA is applied in the internal nonprime factors. Then, FFTs, as in-place strategy is available, only two memory
the input and output index mapping is as groups are enough. The HRSB unit is the kernel processing
engine. The commutators located between the memories and
n = 40n 1 + 24n 2 + 15(4n 3 + n 4 )120 HRSB unit provide efficient data routing mechanism which is
k = 40k1 + 96k2 + 105(k3 + 2k4 )120 . (21) controlled by the computation address generator. The twiddle
factors and the FFT size parameters are all prestored in register
For the inverse index mapping of (21), the mutually prime
files, which are used to configure the FFT working modes.
factors n 1 , n 2 , and n
3 = 4n 3 + n 4 should be obtained
The exponent scaling unit controls scaling operations for block
by (19) first, and then, the large prime factor n
3 can be further
floating-point, which can increase the signal-to-quantization-
factorized into small nonprime factors n 3 and n 4 by CFA.
noise ratio and reduce the memory storage.
All the factorized factors, n 1 , n 2 , n 3 , and n 4 , can be used to
The 2n -point FFT is operated in continuous-flow mode with
generated the memory bank and address according to (20).
the in-place strategy using only two parallel radix-2/22/23
Moreover, when large factor is used in the decomposition, the
MDC units. For instance, to execute a 128-point FFT, only one
HRSB scheme can be used to implement the high-radix BU
MDC unit is active in stage 0, and two radix-23 MDC units
as stated in Section III-B.
are activated in the remaining stages. The computation can be
completed within 128 clock cycles. The input and output index
IV. FFT A RCHITECTURES IN LTE S YSTEMS vector generators in the 2n -point FFT are merged to one. The
There are two different FFT modules in LTE system: one only difference is that the binary representation of the index
is the 2n mode (n = 7–11) FFT, and the other is the for the input is in a forward manner, while it is in a reversed
35 different-length NSPP DFT, whose transform sizes range manner for the output. The architectures of the index vector
from 12 to 1296. generator and the radix-23 MDC unit are detailed in [12].
XIA et al.: MEMORY-BASED FFT PROCESSOR DESIGN WITH GENERALIZED EFFICIENT CONFLICT-FREE ADDRESS SCHEMES 1925

The length of the NSPP DFT can be generally expressed as


N = 2 p 3q 5r , p = 2, . . . , 8; q = 1, . . . , 5; q = 1, 2. (22)
Because the factors 2, 3, and 5 are mutually prime, [14] simply
factorizes N into three different prime parts, namely, 2 p , 3q ,
and 5r using the PFA, and applies the CFA within different
parts. However, this factorization method cannot completely
satisfy the continuous-flow mode. For example, the 972-point
DFT is decomposed into 4 × 3 × 9 × 9, in [14]. The nine-
point DFTs are computed using a radix-3 delay element matrix
within three cycles, and radix-4 and radix-3 DFTs can both
be computed in one cycle. Therefore, the total computation
cycle is 243 + 324 + 324 + 324 = 1215, which is more
than the computation length. Moreover, it suffers a much more
complicated data routing scheme.
In this paper, we present a new factorization method that
will make the continuous-flow mode completely available
and simultaneously ensure the effectiveness of the conflict-
free address scheme. The DFT length N is first decomposed
into three prime parts as in [14]. If the number of compu-
tation cycles is not available to support the continuous-flow
mode, some trivial factors are combined together to make the
decomposition more compact. The proposed new factorization
method consists of the following steps.
i) Obtain each value of p, q, and r .
ii) If p > 4, 2 p is decomposed into 24 and 2 p−4 , except
that 25 is composed of 23 and 22 ; otherwise, there is
only one level for 2 p . For 3q , if q = 5, there will be
three levels, 32 , 32 , and 3; 32 and 3q−2 for q = 3, 4;
and one level for q = 1, 2. And only one level for 5r .
Fig. 5. (a) Index vector generator. (b) n + 1-level N -modulo operation.
iii) Calculate the computation time based on the above (c) Commutator architecture between the memory banks and PE.
decomposition. If the decomposition levels for 2 p , 3q ,
and 5r are i , j , and k, the computation time of the
complete decomposition is T = i · (N/4) + j · (N/3) +
k · (N/5). of the input index n are obtained using the inverse PFA
iv) If the computation time T is still larger than N, some mapping in (19). According to (19), the inverse PFA mapping
trivial factors in each part will be combined. However, methods of the input and output to obtain the prime factors
the 24 , 32 , and 52 factors should remain unchanged. are different. Then, some of the large prime factors should
v) Calculate the new computation time. If it still does not be further decomposed to trivial factors by the CFA. At last,
satisfy the time restriction, step to iv) again. all the trivial factors are used to obtain the memory bank
According to steps 1–3, 972-point FFTs are decomposed into and address according to (20). Take the 1296-point FFT for
4 × 3 × 9 × 9. However, the computation time, as mentioned example. The index counter counts from 0 to 1295, and then
above, does not satisfy the time requirements. We then com- the prime factors of each index, ranging from [0, . . . , 80]
bine the trivial factors 4 and 3 together to reduce the compu- and [0, . . . , 15], are obtained through (19). The value of the
tation time further, and obtain the composition 9 × 9 × 12. prime factor 81 can further be divided into 9 × 9. The three
The high-radix butterfly, such as radix-9 and radix-12, stage factors, 9, 9, and 16, are all implemented by two level
can be implemented by a multistage MDC unit that will trivial factors 3 or 4. We use four memory banks for this FFT
make the entire computation time satisfy the time restriction. length, and the memory bank and address can be obtained
Although this method will make the PFA incompatible to some through the trivial factors using the method in (20). Moreover,
lengths, this situation only exists in lengths of 864 and 972. both the input and output inverse PFA mappings in (19) use
N-modulo operations. We can use an n + 1-level compare-
select tree structure to obtain the last modulo results, as shown
A. Index Vector Generator in Fig. 5(b). In addition, the configurable information, such as
An index vector generator is used to distribute the input the bank numbers and the parameters of the PFA mapping for
data to proper memory positions, and reorder the output data different FFT sizes, are prestored in register files.
to natural sequence. As shown in Fig. 5(a), the index vector In the computation mode, P concurrent data indices of
generator can be designed hierarchically. The index counter the PE are obtained according to the computation address
counts in natural order, and the mutually prime factors n i generator. Then, the bank indices and memory addresses of
1926 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

TABLE I
N UMBER OF C OMPUTATION C YCLES OF D IFFERENT B UTTERFLIES

procedures are similar. When radix-8 butterfly is applied in


the HRSB architecture, as radix-4 is used in the first stage,
there must be two groups of radix-2 data in the second stage
to keep up the processing speed.
When the computations are completed in the first stage
butterfly, it can immediately be calculated by the second stage.
Thus, using the HRSB scheme, the radix-8, 9, 12, 15, 16,
25 DFTs can be completed in 2, 3, 4, 5, 4, and 5 cycles,
respectively, as shown in Table I.
Moreover, the HRSB architecture can also be configured
to bypass the first BU to obtain only radix-2/3/4/5 butter-
flies. As a result, the HRSB architecture can combine all
Fig. 6. Proposed HRSB architecture. (a) Architecture of the two-stage MDC
unit. (b) Time scheme of the radix-25 BU. the required computation butterflies into one PE. For the
972-point FFTs, as the decomposition manner is 9×9×12 and
the butterfly engine is implemented by HRSB architecture, the
each data are computed through the proposed address scheme entire computation time is 2 × 3 × (972/9) + 4 × (972/12) =
in (20). Since the indices are different in one digit n s of all the 972. Thus, the computation cycles of FFTsthat apply the
decomposition components, only one bank index is needed to HRSB architecture can be acquired by T = m i=1 Ti · N/Ni ,
be computed while others can be easily derived from it. When where m is the computation stages, Ni is one stage’s butterfly
data are read out from different memory banks, they switch to radix, and Ti is the butterfly’s computation cycles. Although
the right data paths of the PE through the input commutator as PFA is applied in DFT, twiddle factors remain between the
shown in Fig. 5(c). The switching pattern of the commutator is common-factor components. The twiddle factors are stored in
controlled by the bank indices above. When the data have been register files to be obtained by the computation indices of the
calculated in the PE, all the data are sent back to the place PE. By utilizing the π/8 symmetry property of trigonometric
where they are read out. Thus, the output commutator of the functions, total 330 twiddle factors are prestored for the
PE has the reversed switching pattern of the input commutator. DFT unit.
2) Unified WFTA Butterfly Core for Radix-2/3/4/5
Butterflies: In the butterfly engine, the small radix-2/3/4/5
B. Butterfly Engine Applys HRSB Scheme
butterflies are implemented by a unified WFTA core. The
1) HRSB Architecture for Radix-8/9/12/15/16/25 WFTA for prime-factor DFTs can be written as
Butterflies: The high-radix butterflies according to the
decomposition algorithm, such as radix-8/9/12/15/16/25, [X (0), . . . , X (N − 1)]T = O × M × I ×[x(0),. . . ,x(N −1)]T
are implemented by the HRSB scheme, i.e., a two-stage (23)
MDC unit. As shown in Fig. 6(a), the HRSB architecture
contains two-stage unified radix-2/3/4/5 butterfly cores, some where I represents the preaddition between the inputs, M is
different-length delay elements, a switch block, trivial twiddle a diagonal matrix of coefficient multiplication, and O is the
factors between the two butterflies, and some point control post-addition of the results after multiplication [16]. In [18],
multiplexers. The two small BUs are connected in a pipeline a reconfigurable 2-, 3-, 4-, 5-, 7-point WFTA butterfly core
manner to construct a one-level computation stage. The time for memory-based FFTs is proposed. The complexity of the
operation of the two-stage radix-25 BU is shown in Fig. 6(b). butterfly core is equal to that of the 7-point DFT in terms of
In the first BU stage, 5 concurrent data are sent into the first adders and multipliers plus some multiplexers. To be adopted
BU. After being computed, the data will subsequently be sent in our implementation, the butterfly core should complete two
to the delay element and switch unit to change the interval radix-2 butterflies or one radix-3/4/5 butterfly within one cycle
of the parallel data for the second BU. Finally, the computed in reconfigurable mode.
data will return back to the in-place memory bank locations As in [18], we can set the input of the butterfly core to
after reordering. Although the time flow graphs of radix-8, zero and the multiplication coefficients to zero or one to
12, 15 BUs are not regular as the radix-9, 16, 25 BUs, the reduce the number of multiplexers. The unified architecture
XIA et al.: MEMORY-BASED FFT PROCESSOR DESIGN WITH GENERALIZED EFFICIENT CONFLICT-FREE ADDRESS SCHEMES 1927

TABLE II
C ONFIGURATION I NFORMATION OF THE U NIFIED WFTA C ORE

TABLE III
C OMPARISON OF D IFFERENT S INGLE -P OWER P OINT FFT S

processing time directly. Radix-2 designs [19], [20] lead to


simple conflict-free control mechanisms and less arithmetic
units, however, the processing time is always too long to
be executed in high-throughput applications. For high-radix
designs [21], [22], the processing time is faster, while the arith-
metic units are more consumptive. Compared with other FFTs,
the radix-2k algorithm in our design is a tradeoff between
throughput and hardware resources. The throughput and the
hardware units can be adjusted by the applied algorithm and
data parallelism concurrently, no matter using in high or low
throughput situations.
For NSPP FFTs, we compare the hardware resources of
several DFTs with different address schemes in Table IV.
Fig. 7. Proposed unified BU for two-, three-, four-, and five-point DFTs. Compared with the schemes in [13], [14], [23], and [24], our
method can fully satisfy the continuous-flow mode for all the
is shown in Fig. 7, and is almost the same as the five- 35 prime-factor lengths in LTE system. Even the most aggres-
point WFTA architecture with only three additional single-gate sive competitor [14] is too slow to satisfy a DFT of size 972.
multiplexers and one 3 : 1 multiplexer. The unified WFTA core The proposed method requires fewer computation cycles than
has a much simpler architecture than the multi-radix structure the others. Hence, our FFT owns the highest processing speed.
in [14], which uses more multiplexers. The input and output Though our method consumes more memory than [14], it can
relations, the required coefficients and the control signal of the unify the ultimate results, which reduces the difficulty to
multiplexers are all shown in Table II, where the dash means an evaluate the correctness of the FFTs. There are 13 complex
irrelevant condition. The middle coefficient multiplications can adders, 5 constant multipliers for coefficient multiplications,
be implemented by constant multipliers that are constructed of and 5 complex multipliers for twiddle factor multiplications
shifters and adders. in the unified butterfly core. Thus, the resources in the PE are
double. The PE in [14] occupies almost the same hardware
V. H ARDWARE I MPLEMENTATION R ESULTS AND as ours. However, it uses more control multiplexers and
P ERFORMANCE C OMPARISON buffers, and the controlling mechanism is more complicated.
In conclusion, our DFT is the most hardware-efficient one
The proposed address scheme can support arbitrary-lenth
Area
FFT computations, including SPP FFTs and NSPP FFTs. Norm. Area =
For SPP FFTs, the proposed address scheme adjusts the (L min /55nm)2
decomposition algorithm and the parallelism of PE to modify Power × Tclock × Nexecute
Norm. Energy = . (24)
the throughput. We compare several SPP FFTs that have NFFT (Volt./1.08V)2 (L min /55nm)
different conflict-free address schemes with our 2n -point FFT We implemented the two FFT units in LTE system using
in Table III. The applied algorithm and data parallelism the SMIC 55 nm technology. A summary of the results and a
decide the amount of resources of the architecture and the performance comparison with [14] are given in Table V. The
1928 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

TABLE IV
P ERFORMANCE C OMPARISON OF THE DFT S W ITH D IFFERENT A DDRESS S CHEMES

TABLE V VI. C ONCLUSION


I MPLEMENTATION P ERFORMANCE AND C OMPARISON WITH [14]
We present memory-based FFT implementations with
generalized efficient conflict-free address schemes. Address
schemes for different FFT lengths are integrated in this paper
to support FFT processing for various systems. The memory
bank and address can be generated by modulo and multipli-
cation operations of the decomposition digits. For both SPP
and NSPP FFTs, high-radix algorithm and parallel-processing
technique can be used to increase the throughput. And the
address scheme for FFTs applied with PFA is explored.
Moreover, a decomposition method, named HRSB, is designed
to suit the high-radix algorithm. Full hardware architectures for
the FFTs in LTE systems are illustrated, including the index
vector generator, the butterfly engine, and the unified WTFA
core. The implementation results and comparisons are also
presented.
normalized area and normalized energy are defined as (24),
where Nexecute is the number of clock cycles required to
perform the FFT, NFFT is the FFT size, Tclock is the clock ACKNOWLEDGMENT
period, and L min is the used technology [2]. The two FFT The authors would like to thank the anonymous reviewers
units are both evaluated at the maximum size, i.e., 1296 and for their careful review and precious comments.
2048 points. This paper includes the same design parameter
with [14]. However, the normalized area and power are both R EFERENCES
smaller. Our method has a simpler address control scheme
than [14]. Reference [14] applies seven memory banks for [1] C.-L. Yu, K. Irick, C. Chakrabarti, and V. Narayanan, “Multidimensional
DFT IP generator for FPGA platforms,” IEEE Trans. Circuits Syst. I,
the DFT unit and five banks for the FFT unit, whereas Reg. Papers, vol. 58, no. 4, pp. 755–764, Apr. 2011.
our method only requires five and four banks, respectively. [2] C.-H. Yang, T.-H. Yu, and D. Markovic, “Power and area minimization
According to [13], more memory banks will lead to greater of reconfigurable FFT processors: A 3GPP-LTE example,” IEEE J.
Solid-State Circuits, vol. 47, no. 3, pp. 757–768, Mar. 2012.
area overhead and power dissipation. Moreover, the HRSB [3] D. Cohen, “Simplified control of FFT hardware,” IEEE Trans. Acoust.,
architecture and the unified WFTA core in the DFT unit are Speech, Signal Process., vol. 24, no. 6, pp. 577–579, Dec. 1976.
more hardware-efficient than the delay element matrix and [4] M. C. Pease, “Organization of large scale Fourier processors,” J. ACM,
vol. 16, no. 3, pp. 474–482, Jul. 1969.
the multi-radix structure in [14]. The 128–2048-point FFT [5] L. G. Johnson, “Conflict free memory addressing for dedicated FFT
is an SPP FFT unit, while the 12–1296-point FFT is an hardware,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.,
NSPP one. As stated in Section IV, memory consumption is vol. 39, no. 5, pp. 312–316, May 1992.
[6] B. G. Jo and M. H. Sunwoo, “New continuous-flow mixed-
2N in SPP FFTs, and is 3N in NSPP FFTs. Actually, [14] radix (CFMR) FFT processor using novel in-place strategy,” IEEE Trans.
uses only N memory amount in the design of two FFT Circuits Syst. I, Reg. Papers, vol. 52, no. 5, pp. 911–919, May 2005.
units, which cannot fully support the continuously-flow mode. [7] J. Baek and K. Choi, “New address generation scheme for memory-
Although our design uses more memory, it achieves better based FFT processor using multiple radix-2 butterflies,” in Proc. Int.
SoC Design Conf., vol. 1. Nov. 2008, pp. I-273–I-276.
area and power performance, which further verifies that our [8] D. Reisis and N. Vlassopoulos, “Conflict-free parallel memory accessing
method provides greater hardware efficiency. The FFT and techniques for FFT architectures,” IEEE Trans. Circuits Syst. I, Reg.
DFT units occupy 0.615- and 1.063-mm2 core area, 19.2- and Papers, vol. 55, no. 11, pp. 3438–3447, Dec. 2008.
[9] K. H. Chen and Y. S. Li, “A multi-radix FFT processor using pipeline in
26.3-mW power consumption, respectively, at 122.88-MHz memory-based architecture (PIMA) FOR DVB-T/H systems,” in Proc.
clock frequency. Int. Conf. Mixed Design Integr. Circuits Syst., Jun. 2008, pp. 549–553.
XIA et al.: MEMORY-BASED FFT PROCESSOR DESIGN WITH GENERALIZED EFFICIENT CONFLICT-FREE ADDRESS SCHEMES 1929

[10] S.-Y. Lee, C.-C. Chen, C.-C. Lee, and C.-J. Cheng, “A low-power VLSI Kai-Feng Xia received the B.S. degree from Wuhan
architecture for a shared-memory FFT processor with a mixed-radix University, Wuhan, China, in 2010. He is currently
algorithm and a simple memory control scheme,” in Proc. IEEE Int. pursuing the Ph.D. degree in electronic engineering
Symp. Circuits Syst., May 2006, pp. 157–160. with the Institute of Microelectronics of Chinese
[11] P.-Y. Tsai, T.-H. Lee, and T.-D. Chiueh, “Power-efficient continuous- Academy of Sciences, Beijing, China.
flow memory-based FFT processor for WiMax OFDM mode,” in Proc. His current research interests include VLSI archi-
Int. Symp. Intell. Signal Process. Commun., Dec. 2006, pp. 622–625. tectures for communication and signal processing
[12] P.-Y. Tsai and C.-Y. Lin, “A generalized conflict-free memory addressing systems.
scheme for continuous-flow parallel-processing FFT processors with
rescheduling,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19,
no. 12, pp. 2290–2302, Dec. 2011.
[13] C.-F. Hsiao, Y. Chen, and C.-Y. Lee, “A generalized mixed-radix
Bin Wu received the B.S. degree from Xi Hua Uni-
algorithm for memory-based FFT processors,” IEEE Trans. Circuits
versity, Chengdu, China, in 1999, the M.S. degree
Syst. II, Express Briefs, vol. 57, no. 1, pp. 26–30, Jan. 2010.
from Chongqing University, Chongqing, China, in
[14] J. Chen, J. Hu, S. Lee, and G. E. Sobelman, “Hardware efficient mixed
2002, and the Ph.D. degree from the Institute of
radix-25/16/9 FFT for LTE systems,” IEEE Trans. Very Large Scale
Microelectronics of Chinese Academy of Sciences,
Integr. (VLSI) Syst., vol. 23, no. 2, pp. 221–229, Feb. 2015.
Beijing, China, in 2011.
[15] C. Burrus, “Index mappings for multidimensional formulation of the
He is currently a Professor with the Insti-
DFT and convolution,” IEEE Trans. Acoust., Speech, Signal Process.,
tute of Microelectronics of Chinese Academy of
vol. 25, no. 3, pp. 239–242, Jun. 1977.
Sciences. His current research interests include high-
[16] D. Kolba and T. Parks, “A prime factor FFT algorithm using high-speed
performance/lower power VLSI designs for broad-
convolution,” IEEE Trans. Acoust., Speech, Signal Process., vol. 25,
band wire-line and wireless communication systems.
no. 4, pp. 281–294, Apr. 1977.
[17] C. Temperton, “A generalized prime factor FFT algorithm for any
N = 2 p 3q 5r,” SIAM J. Sci. Statist. Comput., vol. 13, no. 3, pp. 676–686, Tao Xiong received the B.S. degree from the
Mar. 1992. Huazhong University of Science and Technology,
[18] F. Qureshi, M. Garrido, and O. Gustafsson, “Unified architecture for 2, 3, Wuhan, China, in 2013. He is currently pursuing the
4, 5, and 7-point DFTs based on Winograd Fourier transform algorithm,” M.S. degree with the Institute of Microelectronics of
Electron. Lett., vol. 49, no. 5, pp. 348–349, May 2013. Chinese Academy of Sciences, Beijing, China.
[19] X. Xiao, E. Oruklu, and J. Saniie, “An efficient FFT engine with His current research interests include low-power
reduced addressing logic,” IEEE Trans. Circuits Syst. II, vol. 55, no. 11, VLSI designs.
pp. 1149-–1153, Nov. 2008.
[20] H.-F. Luo, Y.-J. Liu, and M.-D. Shieh, “Efficient memory-addressing
algorithms for FFT processor design,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 23, no. 10, pp. 2162–2172, Oct. 2015.
[21] S.-J. Huang and S.-G. Chen, “A high-throughput radix-16 FFT processor
with parallel and normal input/output ordering for IEEE 802.15.3c Tian-Chun Ye received the B.S. degree from Fudan
systems,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 8, University, Shanghai, China, in 1986.
pp. 1752–1765, Aug. 2012. He has been a Professor with the Institute of
[22] M. Garrido, M. Á. Sánchez, M. L. López-Vallejo, and J. Grajal, Microelectronics of Chinese Academy of Sciences
“A 4096-point radix-4 memory-based FFT using DSP slices,” IEEE (IMECAS), Beijing, China, since 1997. He is cur-
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 1, pp. 375–379, rently the Director of IMECAS and the China Inter-
Jan. 2017. [Online]. Available: http://ieeexplore.ieee.org net of Things Development Center, Wuxi, China.
[23] Altera. (2013). DFT/IDFT Reference Design. [Online]. Available: His current research interests include semiconductor
http://www.altera.com device and IC fabrication processes and equipment
[24] Xilinx. (2014). LogiCORE IP Discrete Fourier Transform V4.0. [Online]. technologies.
Available: http://www.xilinx.com

You might also like