FFT Imp Butterfly

Chapter 4
Classification of FFT Processor

Architectures
There are numerous architectures for realizing FFT computations in real time
applications. In this chapter, we will only discuss two main classes: pipeline
architecture and shared memory-based architecture. We also compare different
architectures with respect to hardware requirement cost and throughput.
4.1 Pipeline Architecture
Pipeline architectures utilize parallel processing among the stages. The general
structures of these architectures [12] are as shown in Figure 4.1. These structures
consist of one butterfly element between commutators at each stage. The function of
butterfly element is to compute the data addition and subtraction. The commutator is
like a switch to rearrange the data from the butterfly element in order to perform
subsequent calculations more conveniently.
Pipeline architecture is widely used to attain high-speed operation. The problem
with this architecture is the relatively large die area required by the logNr butterfly
elements. Pipeline architecture can be roughly classified into three types:
 single path delay feedback (SDF) architecture,
 multiple path delay commutator (MDC) architecture and
 single path delay commutator (SDC) architecture.
These types have different properties, each with advantages and disadvantages. We
will describe these properties in the following subsections.
-25-
commutato
commutato
commutato
radix-r radix-r
butterfly radix-r butterfly butterfly
processing processing processing
element element (BF_PE) element
(BF_PE) (BF_PE)
r
Figure 4.1 Pipeline architecture
4.1.1 Single Path Delay Feedback (SDF) Architecture
In the general single path delay feedback architecture, the input data sequences
pass through one single path. The butterfly processing element performs the
computations on the data. This structure uses shift registers or RAM to store the
output data from the butterfly elements. As a result of the single output path, only one
complex multiplier is required.
Here, we give two examples to illustrate the principle of single path delay
feedback architecture. One example is a 16-point R2SDF and the other is a 16-point
R4SDF.
(1) Radix-2 single delay feedback (R2SDF)

Figure 4.2 is a 16-point radix-2 single path delay feedback architecture[24]. The
procedural flow of this architecture is as follows.
(a) In the beginning, the input data with indices 0 to 7 are stored in the shift
register (D8 in Figure 4.2). The radix-2 butterfly elements operate on
these data and on the remaining input data with indices 8 to 15.
(b) The data resulting from the butterfly addition operations are passed to
the second stage and the subtraction results are fed back to the shift
register (D8).
-26-
(c) After all the 8-point data addition results are output, the subtraction data
from the registers are passed to the second stage with a multiply twiddle
factor coefficient. The symbol A in Figure 4.2 and Figure 4.3 represents
the last output data of the first stage.
(d) The procedure flow of the latter three stages for the 16-point R2SDF are
similar to the first stage. The routing of data for R2SDF is given in
Figure 4.3.
D8 D4 D2 D1
radix-2 butterfy A radix-2 butterfy B radix-2 butterfy C radix-2 butterfy

input output
data data
twiddle twiddle twiddle
factor factor factor
coefficien coefficien coefficien
t t t
Figure 4.2 A 16- point radix-2 single path delay feedback architecture
input
data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
output A data 0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1) 10(11 1 12(1)13(1)14(1)15(1)
) (1)
output B data 0(2) 1(2) 2(2) 3(2) 4(2) 5(2) 6(2) 7(2) 8(2) 9(2) 10( 21) 1(21) 2(21) 3(21) 4(21) 5 (2)
output C data 0(3) 1(3) 2(3) 3(3) 4(3) 5(3) 6(3) 7(3) 8(3) 9(3) 10( 31) 1(31 2(3)13(31) 4(3)15(3)
final output data 0(4) 1(4) 2(4) 3(4) 4(4) 5(4) 6(4) 7(4) 8(4) 9(4) 10(41) 1(41) 2( 41) 3(41) 4( 41) 5 (4)
Figure 4.3 Routing of data for R2SDF (N=16)
(2) Radix-4 single delay feedback (R4SDF)

Radix-4 single delay feedback (R4SDF) is a variant of R2SDF. In R4SDF, the
butterfly operation is based on the radix-4 algorithm. The procedural flow of this
architecture [25] is identical to that of R2SDF. Figure 4.4 is a block diagram of the
radix-4 single delay feedback architecture. The schematic timing diagram is given in
-27-
Figure 4.5. The data output A in Figure 4.5 represents the data resulting from the 16-
point radix-4 butterfly computation of the first stage. The output data in Figure 4.5
represents the data resulting from butterfly computation in the second stage.
shift register shift register

input A
D D D D D
data
commutato
commutato
commutato
commutato
output
twiddle factor coefficient data
D D D D D
D D D D D
r
r
where D= delay
radix4
radix
BF
BF
4
Figure 4.4 A 16- point radix-4 single path delay feedback architecture
input
data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
data of output A 0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1) 10(11
) (1)
1 12(1)13(1)14(1) 15(1)
output
data 0(2) 1(2) 2(2) 3(2) 4(2) 5(2) 6(2) 7(2) 8(2) 9(2) 10(21) 1(21) 2(21) 3(21) 4(21) 5(2)
Figure 4.5 Timing diagram for R4SDF (N=16)
4.1.2 Multiple Path Delay Commutator (MDC) Architecture
In the general multiple path delay feedback architecture [14], the input data
sequences are organized into a parallel data path entering the commutator by suitable
delays and data summation is performed by the butterfly element at each stage. The
intermediate data is not fed back to register. It has a higher throughput data rate than
the single path delay feedback architecture.
-28-
Here, we give an example to illustrate the principles of the multiple path delay
commutator architecture. The architecture of a 16-point Radix-4 multiple path delay
commutator architecture (R4MDC) is as shown in Figure 4.6. The data routing
diagram is depicted in Figure 4.7. The workflow for this architecture is
(a) In the beginning, the input data from index 0 to 15 is decomposed into
four parallel data streams of four points each after the first commutator.
(b) These data are passed through different delay registers.
(c) After the delay, the resulting data are operated on by radix-4 butterfly
processing elements (BF4) and then multiplied by a twiddle factor.
(d) The result data are stored in the delay register.
(e) The function of the second commutator is to reorder the desired adjacent
data.
(f) Before the second butterfly operation, the ordered data are stored in an
appropriate delay register. The first parallel data stream must be shifted
by 3 delays. The second parallel data stream must be shifted by 2 delays.
The third parallel data stream must be shifted by 1 delay and the last
data stream is directly passed to the butterfly input in the second stage.
input
data D D
12 3
twiddle factor
D coefficient D output
commuta 8 D commut 2 data
tor 1 BF 1 BF4
4 a
twiddle factor tor 2
D coefficient D
4 D 1
2
twiddle factor
coefficient
D
3
Figure 4.6 A 16-point radix-4 multiple path delay commutator architecture
-29-
input
data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
commutator
1
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
outpu delay 12
t delay 8
dela
y delay 4
0 1 2 3
radix 4 4 5 6 7
butterfl
y data 8 9 10 11
12 13 14 15
delay 1
input
dela delay 2
y delay 3
0 1 2 3
4 5 6 7
8 9 13
10 11
12 14 15
commutator
2
0 4 8 12
5 9
1 13
2 6 10 14
3 7 11 15
outpu delay 3
t delay 2
dela
y delay 1
0 4 8 12
1 5 9 13
radix 4
butterfl 2 6 10 14
y data
3 7 11 15
Figure 4.7 Timing diagram for R4MDC (N=16)
-30-
4.1.3 Single Path Delay Commutator (SDC) Architecture
The single path delay commutator (SDC) architecture [15][16] is based on the
modified multiple path delay commutator. In each word cycle, each stage produces a
single output rather than 4 outputs as in the multiple path delay commutator. Each
stage requires one complex multiplier, a delay commutator to correct the order of the
data, and a butterfly element. Here, we give an example to illustrate the principle of
single path delay commutator architecture. The architecture of a 16-point radix-4
single path delay commutator (R4SDC) is as shown in Figure 4.8.
output
twiddle data
input commu butterfl factor commu butterfl
data tator 1 y tator 2 y
elemen elemen
t t
c1 c2 c3 c1c2 c3
Figure 4.8 A 16-point radix-4 single path delay commutator architecture
Figure 4.9 is the commutator for the R4SDC at each stage. There are six shift
registers. The symbol Nt represents the number of delays for each shift register. The
multiplexer control parameters c1,c2, and c3 select the required data. The parameters
m1 and m2 are switch control signals to reorder data. Its function is graphically charted
in Figure 4.10 and Table 4.1. The timing diagram for the 16-point R4SDC is described
in Figure 4.11.
-31-
A B C D
Nt Nt Nt
1
input data E x[n+ 3N /4]
0
Nt
x[n+ 2N /4]
c1
1 sw itch
F
0
Nt
x[n+ N /4]
c2
1
G
0
Nt x[n]
c3
m1 m 2
Figure 4.9 Delay commutator for R4SDC at each stage
data0 data12 data4 data12 data8 data12 data12 data12

data4 data0 2 data0 data1 data0 data0 data0
data8 2
(a) (b) (c) (d)
Figure 4.10 Action of switch in Figure 4.9 for 16-point FFT
Table 4.1 Control signals at each stage in Figure 4.9
c1 c2 c3 m1 m2 Figure
1 1 1 0 0 Figure 4.10 (a)
0 1 1 0 1 Figure 4.10 (b)
0 0 1 1 0 Figure 4.10 (c)
0 0 0 1 1 Figure 4.10 (d)
-32-
input
data
A
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
B
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
C
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1
commutator
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
E
F 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
G
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0(1) 1(1) 2(1) 3(1) 4(1) 5(1)6(1) 7Radix

(1) (1) (1) (1) (1) (1) (1) (1) (1)
8 4 9 10 11 12 13 14 15
butterfly
data of output A 5(1)
B
0(1) 1(1) 2(1) 3(1) 4(1) 5(1)6(1) 7(1) 8(1) 9(1)10(11) 1(1)12(1)13(1)14(1)1
C 0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1)10(11) 1(1)12(1)13(1)14(1)15(1)
D
0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1)10(11) 1(1)12(1)13(1)14(1)15(1)
2
commutator
E
0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1) 10(11) 1(1)12(1)13(1)14(1) 15(1)
F
0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1)10(11) 1(1)12(1)13(1)14(1)15(1)
G
0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1)10(11) 1(1)12(1)13(1)14(1)15(1)
Radix 4
butterfly
output
data 0(2) 1(2) 2(2) 3(2) 4(2)5(2) 6(2) 7(2)8(2) 9(2)10(21) 1(21) 2(21) 3(21) 4(21) 5(2)
Figure 4.11 Timing diagram for 16-point R4SDC
4.2 Memory-Based Architecture
Memory based architectures, unlike pipelined architectures, generally use only
one butterfly processor. Thus, their advantage is to save hardware cost. Because a
single butterfly processing element can complete only one butterfly operation per
clock cycle, it implies a loss in overall processing speed. We can compensate for this
disadvantage by using the high radix algorithm [17]. Furthermore, we can increase the
system clock rate to achieve the required speed.
Besides the butterfly processor, a memory based architecture as shown in Figure
4.12 also includes main memory, an address generator, and commutators. The
-33-
butterfly input and output are connected to main memory. The function of the address
generator is to control the read/write addresses for memory access. The commutator
arranges the data after reading from and writing back to the memory so as to generate
the correct order for FFT computation.
commutator shared commutator

memory
partition 1
partition 2
partition r
radix r butterfly PE
Figure 4.12 Schematic diagram of memory architecture
4.2.1 Memory–Based Processing Method

(1) Single memory-based architecture
The simplest memory-based architecture is the single memory-based architecture
[18]. It has only one memory, as shown in Figure 4.13. Hence it can be used in place
of a memory addressing strategy that stores the new FFT operation data in the same
location used by the previous FFT operation data.
processor memory
Figure 4.13 Single memory-based architecture block diagram
-34-
(2) Dual memory-based architecture
Dual memory-based architecture [11] performs “ping-pong-like” actions to
accomplish read-and-then-write operations. Each of the two memories alternates their
functions between butterfly inputs and butterfly outputs. This means that it can read
out input data from the memory and write the output results back to the memory
simultaneously. Figure 4.14 is the schematic diagram of dual memory-based
architecture.
memory 1 processo memory 2

r
Figure 4.14 Dual memory-based architecture block diagram
(3) Cache memory-based architecture

Cache memory-based architecture shown in Figure 4.15 consists of one
processor, a small cache memory and a main memory. It is similar to the single
memory architecture shown in Figure 4.15, but with a cache memory. The purpose of
this architecture is to reduce the number of main memory accesses and to reduce
power consumption even if the FFT length is large [11][19].
processor cache memory
Figure 4.15 Cache memory-based architecture block diagram
-35-
4.2.2 Radix-2 Memory-Based Architecture
Reference [20] describes memory-based FFT implementations based on one
radix-2 butterfly structure, as shown in Figure 4.16. It uses an in-place memory
addressing scheme and a memory bank structure in order to increase the efficiency of
memory access. In addition, [20] proposes using an address generator only for the
fixed radix-2 algorithm. It partitions the main memory into two N/2 location banks.
Here, we introduce the in-place memory addressing scheme that stores the
butterfly output data, overwriting the memory locations being simultaneously read.
This scheme provides the N-data points stored in an r-bank memory without conflicts
in memory read/write access. The algorithm in [20] is as follows:
Data _ count  [dn1, dn2 , ......... , d2 , d1 , d0 ]r

n  log r N 
(4.1)
Bank _ index  (dn1  d n2  d2  d1  d0 ) mod r
Address _ index  [dn1, dn2 , .......... , d2 , d1 ]r
In Equation (4.1), Data_count represents the original data index. Bank_index
expresses the appropriate value of bank after memory partition. Address_index is the
new address location in the assigned Bank_index.
-36-
commutator shared commutator
memory
bank0
bank1
radix 2
butterfly
Figure 4.16 Radix-2 memory-based FFT processor
4.2.3 Radix-4 memory-Based Architecture
The architecture proposed in [18] uses a single memory. To achieve high speed,
it uses the radix-4 algorithm, as illustrated in Figure 4.17. It applies the in-place
memory addressing in Section 4.2.2, but modifies the address generation unit and
control unit.
commutator shared memory commutator
bank3
bank2
bank1
bank0
radix 4
butterfly
Figure 4.17 Radix-4 memory-based FFT processor
-37-
4.3 Comparison of Different Architectures
In this subsection, we summarize different architectures with different hardware
requirements.
Table 4.2 Comparison hardware of different pipelined FFT processor
No. of complex No. of complex

multipliers adders throughput No. of registers
R2SDF log2 N  2 2 log2 N TFFT  Tr , PE .N N-1
R4SDF log4 N 1 8 log4 N TFFT  Tr , PE .N N-1
N
R4MDC 3 log N  3 8 log N T T . (5/2)N-4
4 4 FFT r ,PE
r
log4 N
6N
R4SDC log4 N 1 3 log4 N TFFT  Tr ,PE .N  4 2(N1)
i
i1
Table 4.3 Comparison utilization of different pipelined FFT processor [13]
utilization of utilization of
architecture multipliers utilization of registers
adders
R2SDF 50 % 50 % 100 %
R4SDF 25 % 25 % 100 %
R4MDC 25 % 25 % 25 %
R4SDC 75 % 100 % 100 %
Table 4.4 Comparison hardware of different memory based FFT processor
No. of complex No. of computation

architecture multipliers cycles
radix-2 memory N
based architecture 1 log 2N  2
2
radix-4 memory
N
based architecture 3 log N
4
4
-38-

FFT Imp Butterfly

Uploaded by

Copyright:

Available Formats

FFT Imp Butterfly

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FFT Imp Butterfly

Uploaded by

Copyright:

Available Formats

Chapter 4

Classification of FFT Processor

architecture and shared memory-based architecture. We also compare different

architectures with respect to hardware requirement cost and throughput.

4.1 Pipeline Architecture

subsequent calculations more conveniently.

Pipeline architecture is widely used to attain high-speed operation. The problem

elements. Pipeline architecture can be roughly classified into three types:

 single path delay feedback (SDF) architecture,

 multiple path delay commutator (MDC) architecture and

 single path delay commutator (SDC) architecture.

will describe these properties in the following subsections.

4.1.1 Single Path Delay Feedback (SDF) Architecture

complex multiplier is required.

(1) Radix-2 single delay feedback (R2SDF)

procedural flow of this architecture is as follows.

register (D8 in Figure 4.2). The radix-2 butterfly elements operate on

the last output data of the first stage.

radix-2 butterfy A radix-2 butterfy B radix-2 butterfy C radix-2 butterfy

Figure 4.3 Routing of data for R2SDF (N=16)

(2) Radix-4 single delay feedback (R4SDF)

shift register shift register

Figure 4.5 Timing diagram for R4SDF (N=16)

4.1.2 Multiple Path Delay Commutator (MDC) Architecture

the single path delay feedback architecture.

commutator architecture. The architecture of a 16-point Radix-4 multiple path delay

commutator architecture (R4MDC) is as shown in Figure 4.6. The data routing

diagram is depicted in Figure 4.7. The workflow for this architecture is

(b) These data are passed through different delay registers.

processing elements (BF4) and then multiplied by a twiddle factor.

(d) The result data are stored in the delay register.

by 3 delays. The second parallel data stream must be shifted by 2 delays.

Figure 4.6 A 16-point radix-4 multiple path delay commutator architecture

Figure 4.7 Timing diagram for R4MDC (N=16)

single path delay commutator architecture. The architecture of a 16-point radix-4

single path delay commutator (R4SDC) is as shown in Figure 4.8.

Figure 4.8 A 16-point radix-4 single path delay commutator architecture

Figure 4.9 Delay commutator for R4SDC at each stage

data0 data12 data4 data12 data8 data12 data12 data12

Figure 4.10 Action of switch in Figure 4.9 for 16-point FFT

Table 4.1 Control signals at each stage in Figure 4.9

1 1 1 0 0 Figure 4.10 (a)

0 1 1 0 1 Figure 4.10 (b)

0 0 1 1 0 Figure 4.10 (c)

0 0 0 1 1 Figure 4.10 (d)

0(1) 1(1) 2(1) 3(1) 4(1) 5(1)6(1) 7Radix

Figure 4.11 Timing diagram for 16-point R4SDC

4.2 Memory-Based Architecture

Memory based architectures, unlike pipelined architectures, generally use only

system clock rate to achieve the required speed.

Besides the butterfly processor, a memory based architecture as shown in Figure

the correct order for FFT computation.

commutator shared commutator

Figure 4.12 Schematic diagram of memory architecture

4.2.1 Memory–Based Processing Method

location used by the previous FFT operation data.

Figure 4.13 Single memory-based architecture block diagram

accomplish read-and-then-write operations. Each of the two memories alternates their

simultaneously. Figure 4.14 is the schematic diagram of dual memory-based