FFT Imp Butterfly
FFT Imp Butterfly
FFT Imp Butterfly
There are numerous architectures for realizing FFT computations in real time
applications. In this chapter, we will only discuss two main classes: pipeline
Pipeline architectures utilize parallel processing among the stages. The general
structures of these architectures [12] are as shown in Figure 4.1. These structures
consist of one butterfly element between commutators at each stage. The function of
butterfly element is to compute the data addition and subtraction. The commutator is
like a switch to rearrange the data from the butterfly element in order to perform
with this architecture is the relatively large die area required by the logNr butterfly
These types have different properties, each with advantages and disadvantages. We
-25-
commutato
commutato
commutato
radix-r radix-r
butterfly radix-r butterfly butterfly
processing processing processing
element element (BF_PE) element
(BF_PE) (BF_PE)
r
Figure 4.1 Pipeline architecture
In the general single path delay feedback architecture, the input data sequences
pass through one single path. The butterfly processing element performs the
computations on the data. This structure uses shift registers or RAM to store the
output data from the butterfly elements. As a result of the single output path, only one
Here, we give two examples to illustrate the principle of single path delay
feedback architecture. One example is a 16-point R2SDF and the other is a 16-point
R4SDF.
(a) In the beginning, the input data with indices 0 to 7 are stored in the shift
these data and on the remaining input data with indices 8 to 15.
(b) The data resulting from the butterfly addition operations are passed to
the second stage and the subtraction results are fed back to the shift
register (D8).
-26-
(c) After all the 8-point data addition results are output, the subtraction data
from the registers are passed to the second stage with a multiply twiddle
factor coefficient. The symbol A in Figure 4.2 and Figure 4.3 represents
(d) The procedure flow of the latter three stages for the 16-point R2SDF are
similar to the first stage. The routing of data for R2SDF is given in
Figure 4.3.
D8 D4 D2 D1
Figure 4.2 A 16- point radix-2 single path delay feedback architecture
input
data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
output A data 0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1) 10(11 1 12(1)13(1)14(1)15(1)
) (1)
output B data 0(2) 1(2) 2(2) 3(2) 4(2) 5(2) 6(2) 7(2) 8(2) 9(2) 10( 21) 1(21) 2(21) 3(21) 4(21) 5 (2)
output C data 0(3) 1(3) 2(3) 3(3) 4(3) 5(3) 6(3) 7(3) 8(3) 9(3) 10( 31) 1(31 2(3)13(31) 4(3)15(3)
final output data 0(4) 1(4) 2(4) 3(4) 4(4) 5(4) 6(4) 7(4) 8(4) 9(4) 10(41) 1(41) 2( 41) 3(41) 4( 41) 5 (4)
butterfly operation is based on the radix-4 algorithm. The procedural flow of this
architecture [25] is identical to that of R2SDF. Figure 4.4 is a block diagram of the
radix-4 single delay feedback architecture. The schematic timing diagram is given in
-27-
Figure 4.5. The data output A in Figure 4.5 represents the data resulting from the 16-
point radix-4 butterfly computation of the first stage. The output data in Figure 4.5
represents the data resulting from butterfly computation in the second stage.
commutato
commutato
commutato
output
twiddle factor coefficient data
D D D D D
D D D D D
r
r
where D= delay
radix4
radix
BF
BF
4
Figure 4.4 A 16- point radix-4 single path delay feedback architecture
input
data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
data of output A 0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1) 10(11
) (1)
1 12(1)13(1)14(1) 15(1)
output
data 0(2) 1(2) 2(2) 3(2) 4(2) 5(2) 6(2) 7(2) 8(2) 9(2) 10(21) 1(21) 2(21) 3(21) 4(21) 5(2)
In the general multiple path delay feedback architecture [14], the input data
sequences are organized into a parallel data path entering the commutator by suitable
delays and data summation is performed by the butterfly element at each stage. The
intermediate data is not fed back to register. It has a higher throughput data rate than
-28-
Here, we give an example to illustrate the principles of the multiple path delay
(a) In the beginning, the input data from index 0 to 15 is decomposed into
four parallel data streams of four points each after the first commutator.
(c) After the delay, the resulting data are operated on by radix-4 butterfly
(e) The function of the second commutator is to reorder the desired adjacent
data.
(f) Before the second butterfly operation, the ordered data are stored in an
appropriate delay register. The first parallel data stream must be shifted
The third parallel data stream must be shifted by 1 delay and the last
data stream is directly passed to the butterfly input in the second stage.
input
data D D
12 3
twiddle factor
D coefficient D output
commuta 8 D commut 2 data
tor 1 BF 1 BF4
4 a
twiddle factor tor 2
D coefficient D
4 D 1
2
twiddle factor
coefficient
D
3
-29-
input
data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
commutator
1
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
outpu delay 12
t delay 8
dela
y delay 4
0 1 2 3
radix 4 4 5 6 7
butterfl
y data 8 9 10 11
12 13 14 15
delay 1
input
dela delay 2
y delay 3
0 1 2 3
4 5 6 7
8 9 13
10 11
12 14 15
commutator
2
0 4 8 12
5 9
1 13
2 6 10 14
3 7 11 15
outpu delay 3
t delay 2
dela
y delay 1
0 4 8 12
1 5 9 13
radix 4
butterfl 2 6 10 14
y data
3 7 11 15
-30-
4.1.3 Single Path Delay Commutator (SDC) Architecture
The single path delay commutator (SDC) architecture [15][16] is based on the
modified multiple path delay commutator. In each word cycle, each stage produces a
single output rather than 4 outputs as in the multiple path delay commutator. Each
stage requires one complex multiplier, a delay commutator to correct the order of the
data, and a butterfly element. Here, we give an example to illustrate the principle of
output
twiddle data
input commu butterfl factor commu butterfl
data tator 1 y tator 2 y
elemen elemen
t t
c1 c2 c3 c1c2 c3
Figure 4.9 is the commutator for the R4SDC at each stage. There are six shift
registers. The symbol Nt represents the number of delays for each shift register. The
multiplexer control parameters c1,c2, and c3 select the required data. The parameters
m1 and m2 are switch control signals to reorder data. Its function is graphically charted
in Figure 4.10 and Table 4.1. The timing diagram for the 16-point R4SDC is described
in Figure 4.11.
-31-
A B C D
Nt Nt Nt
1
input data E x[n+ 3N /4]
0
Nt
x[n+ 2N /4]
c1
1 sw itch
F
0
Nt
x[n+ N /4]
c2
1
G
0
Nt x[n]
c3
m1 m 2
c1 c2 c3 m1 m2 Figure
-32-
input
data
A
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
B
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
C
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1
commutator
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
E
F 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
G
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
E
0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1) 10(11) 1(1)12(1)13(1)14(1) 15(1)
F
0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1)10(11) 1(1)12(1)13(1)14(1)15(1)
G
0(1) 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1)10(11) 1(1)12(1)13(1)14(1)15(1)
Radix 4
butterfly
output
data 0(2) 1(2) 2(2) 3(2) 4(2)5(2) 6(2) 7(2)8(2) 9(2)10(21) 1(21) 2(21) 3(21) 4(21) 5(2)
one butterfly processor. Thus, their advantage is to save hardware cost. Because a
single butterfly processing element can complete only one butterfly operation per
clock cycle, it implies a loss in overall processing speed. We can compensate for this
disadvantage by using the high radix algorithm [17]. Furthermore, we can increase the
4.12 also includes main memory, an address generator, and commutators. The
-33-
butterfly input and output are connected to main memory. The function of the address
generator is to control the read/write addresses for memory access. The commutator
arranges the data after reading from and writing back to the memory so as to generate
partition 1
partition 2
partition r
radix r butterfly PE
[18]. It has only one memory, as shown in Figure 4.13. Hence it can be used in place
of a memory addressing strategy that stores the new FFT operation data in the same
processor memory
-34-
(2) Dual memory-based architecture
Dual memory-based architecture [11] performs “ping-pong-like” actions to
functions between butterfly inputs and butterfly outputs. This means that it can read
out input data from the memory and write the output results back to the memory
architecture.
processor, a small cache memory and a main memory. It is similar to the single
memory architecture shown in Figure 4.15, but with a cache memory. The purpose of
this architecture is to reduce the number of main memory accesses and to reduce
-35-
4.2.2 Radix-2 Memory-Based Architecture
addressing scheme and a memory bank structure in order to increase the efficiency of
memory access. In addition, [20] proposes using an address generator only for the
fixed radix-2 algorithm. It partitions the main memory into two N/2 location banks.
Here, we introduce the in-place memory addressing scheme that stores the
butterfly output data, overwriting the memory locations being simultaneously read.
This scheme provides the N-data points stored in an r-bank memory without conflicts
expresses the appropriate value of bank after memory partition. Address_index is the
-36-
commutator shared commutator
memory
bank0
bank1
radix 2
butterfly
The architecture proposed in [18] uses a single memory. To achieve high speed,
it uses the radix-4 algorithm, as illustrated in Figure 4.17. It applies the in-place
memory addressing in Section 4.2.2, but modifies the address generation unit and
control unit.
bank3
bank2
bank1
bank0
radix 4
butterfly
-37-
4.3 Comparison of Different Architectures
requirements.
N
R4MDC 3 log N 3 8 log N T T . (5/2)N-4
4 4 FFT r ,PE
r
log4 N
6N
R4SDC log4 N 1 3 log4 N TFFT Tr ,PE .N 4 2(N1)
i
i1
utilization of utilization of
architecture multipliers utilization of registers
adders
R2SDF 50 % 50 % 100 %
R4SDF 25 % 25 % 100 %
R4MDC 25 % 25 % 25 %
radix-2 memory N
based architecture 1 log 2N 2
2
radix-4 memory
N
based architecture 3 log N
4
4
-38-