A Memory-Based FFT Processor Design With Generalized Efficient Conflict-Free Address Schemes
A Memory-Based FFT Processor Design With Generalized Efficient Conflict-Free Address Schemes
A Memory-Based FFT Processor Design With Generalized Efficient Conflict-Free Address Schemes
Abstract— This paper presents the design and implementation banks to hold intermediate results, processing the data recur-
of memory-based fast Fourier transform (FFT) processors with sively regardless of computation length. Memory-based FFTs
generalized efficient, conflict-free address schemes. We unified achieve better hardware efficiency, compared with pipeline
the conflict-free address schemes of three different FFT lengths,
including the single-power points, the common nonsingle-power ones. Nevertheless, the throughput of memory-based FFTs is
points, and the nonsingle-power points applied with a prime usually restricted by the butterfly radix and concurrent data
factor algorithm. Though the three cases differ in terms of decom- access contentions.
position, they are all compatible with memory-based architecture Conflict-free address schemes for concurrent data access
by the way of the proposed address schemes. Moreover, the from different memory banks become an essential problem.
decomposition algorithm utilizes a method, named high-radix–
small-butterfly (HRSB), to decrease the computation cycles and A parity bit check method for one or more radix-2 PEs is first
eliminate the complexity of the processing engine. In addition, an introduced in [3] and [4]. An in-place strategy from [5] reduces
efficient index generator, a simplified multipath delay commuta- the total memory storage to minimum N. In [6], a mixed
tor engine, and a unified Winograd Fourier transform algorithm radix-4/2 in-place scheme makes the input and output bits
butterfly core were also designed. We designed two FFT examples symmetric, and then, the conflict-free scheme is extended to a
in long-term evolution system to verify the availability of the
address scheme, including a 2n (128–2048)-point FFT unit and a mixed-radix algorithm. In [7], a multiple radix-2 PE scheme
35 different point (12–1296) DFT unit. Compared with previous is demonstrated to increase the throughput of FFT processors.
works with similar address schemes, this paper supports more However, the methods in [6] and [7] are only suitable for
generalized lengths and achieves more flexible throughput. power-of-two point FFTs. In [8], a single- and multiple-PE
Index Terms— Conflict-free address scheme, long-term method for arbitrary radix-b algorithm is discussed. However,
evolution (LTE), memory-based fast Fourier transform (FFT) it requires memory in every stage.
processor, prime factor algorithm (PFA), Winograd algorithm. In [9]–[11], a multipath delay commutator (MDC) archi-
tecture with high radix is used to replace the complex PE
I. I NTRODUCTION in conventional memory-based FFTs. This method provides
low-power dissipation, high data rates, good computational
T HE fast Fourier transform (FFT) is a commonly used
algorithm in digital signal processing areas, such as imag-
ing applications and communication systems. Image process-
efficiency, and refined length flexibility. However, none of
them provides a detailed conflict-free address scheme. In [12],
ing requires computation sizes as high as 222 [1], whereas a generalized conflict-free address scheme with single or
communication systems apply several computation lengths multiple radix-2q MDC architectures for 2n -point FFTs is
simultaneously, such as the sizes of 128–2048 in 3GPP long- illustrated. However, this scheme does not extend the principle
term evolution (LTE) systems [2]. Consequently, long and to arbitrary-length FFTs and more general decomposition
variable-size FFT applications are becoming popular. algorithms. Generalized mixed-radix algorithms that support
FFT processors can be categorized into two architectures: both traditional 2n -point and prime-sized FFTs are proposed
pipeline- and memory-based architectures. Memory-based in [13] and [14]. Although [13] and [14] obtain the memory
FFTs pass the data multiple times through a single butterfly bank and address rules according to the decomposition algo-
processing element (PE) or set of PEs, with several memory rithms, they do not state the data distribution procedure clearly,
especially when prime factor algorithm (PFA) is applied within
Manuscript received April 22, 2016; revised October 23, 2016, the FFT. Moreover, they cannot fully support the continuous-
December 7, 2016, and January 21, 2017; accepted January 31, 2017. Date flow working mode.
of publication March 3, 2017; date of current version May 22, 2017. This
work was supported by the National Major Science and Technology Program This paper presents a memory-based FFT processor design
of China under Grant 2014ZX03001011-002. This paper was presented at methodology with a generalized conflict-free address scheme
the IEEE International Symposium on Circuits and Systems Proceedings, for arbitrary-length FFTs. We unify the conflict-free address
Montreal, Canada, 2016.
The authors are with the Institute of Microelectronics, Chinese schemes of three different FFT lengths, including the single-
Academy of Sciences, Beijing 100029, China (e-mail: xiakaifeng@ime.ac.cn; power point (SPP) FFTs, the common nonsingle-power
wubin@ime.ac.cn). point (NSPP) FFTs, and the NSPP FFTs applied with the
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. PFA, to the same address generation format. The memory
Digital Object Identifier 10.1109/TVLSI.2017.2666820 bank index and the internal address are all generated by
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1920 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017
C. Conflict-Free Address Method Proposed by Tsai and Lin III. C ONTINUOUS -F LOW C ONFLICT-F REE A DDRESS
S CHEME FOR D IFFERENT C OMPUTATION L ENGTHS
A generalized conflict-free memory address scheme for
memory-based FFTs with parallel arithmetic processing units In this section, we will propose the continuous-flow conflict-
is presented in [12]. Assume that the N = 2m FFT free address schemes for different FFT computation lengths,
employs a mixed-radix algorithm with the maximum radix-2q including SPP FFTs, common NSPP FFTs, and NSPP FFTs
MDC units. The FFT is computed in total S stages, where applied with PFA. Although these three cases apply different
S = m/q and · is the ceiling function. At stage s, decomposition algorithms, their conflict-free address methods
the adopted radix is 2qs , with qs ≤ q, and the number of can be unified by a single scheme. The address schemes for
radix-2qs BUs at current stage is 2m−qs . bsI , bsO denote the the three cases build progressively from one to the next.
different bits of the upper and lower binary indices for the
input and output data of the radix-2qs MDC at stage s. A. Address Scheme for Single-Power Point FFTs
Consequently, UsM = {bsI , bsO } represents a set containing The conflict-free address scheme for SPP FFTs pre-
joint constraints in stage s for nonconflict data access. Since sented by Tsai and Lin [12] is illustrated only for
in-place strategy is applied, address representation of the even 2n -point FFTs. We can extend the method to arbitrary-length
symbol for the same memory is in the reversed binary format single-power r n -point FFTs by replacing the XOR operations
of the odd symbol. Thus, bit constraints for forward and in (8) and (9) with r -modulo additions. Assume that the
reversed address representations, UsF and UsR , in one stage N = r m -point FFT employs a mixed radix with the maximum
should be considered concurrently. EXCLUSIVE OR (XOR) radix-r q MDC units. The parallelism of the MDC PE is
operations of all the decimation constraints from stage 0 to P = r p , and P must divide r m−q , the number of butterflies at
S − 1 can be used to obtain the memory bank index. each stage. As a result, formation (1) can be transformed into
For simplicity, we use 128-point FFT to illustrate. If we
apply radix-23 algorithm, the 128-point FFT can be factorized N m
· logr q r m
≤ N → ≤ r p+1 . (10)
into three stages: 2 × 23 × 23 . The forward and reversed r ·r p q
constraints in each stage can be given by However, as the throughput of SPP FFTs is associated with
the PE number closely, it is critical to explore the relation-
U0F = {b[6]}, U1F = {b[5], b[3]} ship between the number of constraint sets and the applied
U2F = {b[2], b[0]}, U0R = {b[0]} algorithm under different cases.
1) Only Increasing the PE Parallelism or the Butterfly
U1R = {b[1], b[3]}, U2R = {b[4], b[6]}. (7)
Radix: If we only increase the parallelism of the PE P, while
keeping the butterfly radix as the simplest radix-r , then q = 1,
There are three disjoint sets of constraints, i.e., G 0 = U0F ∪ and (10) will become m ≤ r p+1 . For any stage s, the r parallel
U2R , G 1 = U0R ∪ U2F , and G 2 = U1F ∪ U1R . If there is only data indices are only different in the (m −s −1)th digit. For the
one MDC PE, the bank mapping function B is the combined m-level forward address representation, there are m different
XOR operation of the three constraints constraint sets and vice versa
TABLE I
N UMBER OF C OMPUTATION C YCLES OF D IFFERENT B UTTERFLIES
TABLE II
C ONFIGURATION I NFORMATION OF THE U NIFIED WFTA C ORE
TABLE III
C OMPARISON OF D IFFERENT S INGLE -P OWER P OINT FFT S
TABLE IV
P ERFORMANCE C OMPARISON OF THE DFT S W ITH D IFFERENT A DDRESS S CHEMES
[10] S.-Y. Lee, C.-C. Chen, C.-C. Lee, and C.-J. Cheng, “A low-power VLSI Kai-Feng Xia received the B.S. degree from Wuhan
architecture for a shared-memory FFT processor with a mixed-radix University, Wuhan, China, in 2010. He is currently
algorithm and a simple memory control scheme,” in Proc. IEEE Int. pursuing the Ph.D. degree in electronic engineering
Symp. Circuits Syst., May 2006, pp. 157–160. with the Institute of Microelectronics of Chinese
[11] P.-Y. Tsai, T.-H. Lee, and T.-D. Chiueh, “Power-efficient continuous- Academy of Sciences, Beijing, China.
flow memory-based FFT processor for WiMax OFDM mode,” in Proc. His current research interests include VLSI archi-
Int. Symp. Intell. Signal Process. Commun., Dec. 2006, pp. 622–625. tectures for communication and signal processing
[12] P.-Y. Tsai and C.-Y. Lin, “A generalized conflict-free memory addressing systems.
scheme for continuous-flow parallel-processing FFT processors with
rescheduling,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19,
no. 12, pp. 2290–2302, Dec. 2011.
[13] C.-F. Hsiao, Y. Chen, and C.-Y. Lee, “A generalized mixed-radix
Bin Wu received the B.S. degree from Xi Hua Uni-
algorithm for memory-based FFT processors,” IEEE Trans. Circuits
versity, Chengdu, China, in 1999, the M.S. degree
Syst. II, Express Briefs, vol. 57, no. 1, pp. 26–30, Jan. 2010.
from Chongqing University, Chongqing, China, in
[14] J. Chen, J. Hu, S. Lee, and G. E. Sobelman, “Hardware efficient mixed
2002, and the Ph.D. degree from the Institute of
radix-25/16/9 FFT for LTE systems,” IEEE Trans. Very Large Scale
Microelectronics of Chinese Academy of Sciences,
Integr. (VLSI) Syst., vol. 23, no. 2, pp. 221–229, Feb. 2015.
Beijing, China, in 2011.
[15] C. Burrus, “Index mappings for multidimensional formulation of the
He is currently a Professor with the Insti-
DFT and convolution,” IEEE Trans. Acoust., Speech, Signal Process.,
tute of Microelectronics of Chinese Academy of
vol. 25, no. 3, pp. 239–242, Jun. 1977.
Sciences. His current research interests include high-
[16] D. Kolba and T. Parks, “A prime factor FFT algorithm using high-speed
performance/lower power VLSI designs for broad-
convolution,” IEEE Trans. Acoust., Speech, Signal Process., vol. 25,
band wire-line and wireless communication systems.
no. 4, pp. 281–294, Apr. 1977.
[17] C. Temperton, “A generalized prime factor FFT algorithm for any
N = 2 p 3q 5r,” SIAM J. Sci. Statist. Comput., vol. 13, no. 3, pp. 676–686, Tao Xiong received the B.S. degree from the
Mar. 1992. Huazhong University of Science and Technology,
[18] F. Qureshi, M. Garrido, and O. Gustafsson, “Unified architecture for 2, 3, Wuhan, China, in 2013. He is currently pursuing the
4, 5, and 7-point DFTs based on Winograd Fourier transform algorithm,” M.S. degree with the Institute of Microelectronics of
Electron. Lett., vol. 49, no. 5, pp. 348–349, May 2013. Chinese Academy of Sciences, Beijing, China.
[19] X. Xiao, E. Oruklu, and J. Saniie, “An efficient FFT engine with His current research interests include low-power
reduced addressing logic,” IEEE Trans. Circuits Syst. II, vol. 55, no. 11, VLSI designs.
pp. 1149-–1153, Nov. 2008.
[20] H.-F. Luo, Y.-J. Liu, and M.-D. Shieh, “Efficient memory-addressing
algorithms for FFT processor design,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 23, no. 10, pp. 2162–2172, Oct. 2015.
[21] S.-J. Huang and S.-G. Chen, “A high-throughput radix-16 FFT processor
with parallel and normal input/output ordering for IEEE 802.15.3c Tian-Chun Ye received the B.S. degree from Fudan
systems,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 8, University, Shanghai, China, in 1986.
pp. 1752–1765, Aug. 2012. He has been a Professor with the Institute of
[22] M. Garrido, M. Á. Sánchez, M. L. López-Vallejo, and J. Grajal, Microelectronics of Chinese Academy of Sciences
“A 4096-point radix-4 memory-based FFT using DSP slices,” IEEE (IMECAS), Beijing, China, since 1997. He is cur-
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 1, pp. 375–379, rently the Director of IMECAS and the China Inter-
Jan. 2017. [Online]. Available: http://ieeexplore.ieee.org net of Things Development Center, Wuxi, China.
[23] Altera. (2013). DFT/IDFT Reference Design. [Online]. Available: His current research interests include semiconductor
http://www.altera.com device and IC fabrication processes and equipment
[24] Xilinx. (2014). LogiCORE IP Discrete Fourier Transform V4.0. [Online]. technologies.
Available: http://www.xilinx.com