FFTF With Verilog
FFTF With Verilog
net/publication/322447765
CITATIONS READS
0 6,440
1 author:
Brad Shelton
University of Utah
2 PUBLICATIONS 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Brad Shelton on 12 January 2018.
Euler’s Formula:
Abstract— The ability to digitally process analog waveforms has
made up a significant portion of electrical and electronics 𝑒 𝑗𝑥 = cos(𝑥) + 𝑗𝑠𝑖𝑛(𝑥). (3)
engineering for years. Over the years, many methods have been
developed to process signals. Yet despite the various techniques
available, the first step of synthesis almost always requires some Although Eq. (2) is in a form that is computable for hardware,
form of Fourier analysis. In hardware, the Fast Fourier it requires a massive amount of transistors. The Fourier series
Transform (FFT) uses the least amount of space to transform demonstrates the scale of calculations required to approximate
sampled time data. The FFT uses a method known as Decimation sine and cosine below:
in Time (DIT) to divide large multiple-word additions into smaller
single-word additions. This article will provide an analysis of the ∞ ∞
Verilog code, layout, routed circuits required for an 8-bit FFT. 𝑎0 𝜋𝑛𝑡 𝜋𝑛𝑡
𝑓(𝑡) ≈ ( ) + ∑ 𝑎𝑛 cos ( ) + ∑ 𝑏𝑛 sin ( ), (4)
Examination of these analyses will allow a conclusion to be made 2 𝐿 𝐿
as to which method(s) best implement the FFT in hardware. 𝑛 𝑛
𝐿
𝜋𝑛𝑥
𝑎𝑛 = ∫ 𝑓(𝑥) cos ( ) 𝑑𝑥, 𝑛 ≤ 1, (5)
Index Terms—Analog to Digital Converter, Butterfly Radix-2, −𝐿 𝐿
Complex Plane, Decimation in Time, Discrete Fourier Transform, 𝐿
𝜋𝑛𝑥
Fast Fourier Transform, Read-Only Memory, Shift Register, 𝑏𝑛 = ∫ 𝑓(𝑥) sin ( ) 𝑑𝑥, 𝑛 ≤ 1, (6)
Verilog, Very Large-Scale Integration. −𝐿 𝐿
1 𝐿
𝑎0 = ( ) ∫ 𝑓(𝑥) 𝑑𝑥, (7)
𝐿 −𝐿
I. INTRODUCTION
given that 𝐿 is half the period of 𝑓(𝑥) and −𝐿 ≤ 𝑥 ≤ 𝐿. Not
O VER the past few decades, the FFT has transformed the
industry of signal processing by evolving communication
over satellite networks, acoustic engineering, and digital filters.
only must the equation itself be computed, the coefficients have
to be calculated as well. To circumvent this problem, a much
Many voice-activated products, RF controllers, and audio simpler algorithm was developed i.e. the split-radix FFT. In
synthesizers rely on the ability to convert time data to the 1965, J. W. Cooley and John Tukey dramatically reduced the
frequency domain. The FFT is derived from the Discrete amount of calculations of the original Fourier Transform –
Fourier Transform (DFT). The DFT takes a sequence of digital known as the Cooley-Tukey FFT [2]. This method uses a
samples (bits) and converts them to the frequency domain for Decimation in Time (DIT) method that divides and calculates
further analysis. Shown below is the equation for the DFT: the even and odd-indexed bits separately. This DIT is done
recursively until there is only one even or odd index of the data
𝑁−1 added to another even or odd index [3]. The DIT equation is
𝑋𝑘 [𝑛] = ∑ 𝑥𝑛 𝑒 −𝑗2𝜋𝑘𝑛 (1) shown below:
𝑛
𝑁 𝑁
−1 𝑗2𝜋𝑘𝑛 −1 𝑗2𝜋𝑘𝑛
2 − 2 −
where 𝑋𝑘 [𝑛] is the resultant data from the transformed complex 𝑁
−
𝑗2𝜋𝑘𝑛 𝑁
𝑋𝑘 [𝑛] = ∑ 𝑥2𝑛 𝑒 2 +𝑒 𝑁 ∗ ∑ 𝑥2𝑛+1 𝑒 2 . (8)
input 𝑥𝑛 function, n represents discrete time, and 𝑘 is the
𝑛 𝑛
sampling frequency [1]. This equation is also represented in its
computable form as follows: This ‘divide and conquer’ algorithm reduces the number of
𝑁
𝑁−1 complex multiplications from (𝑁 − 1)2 to ( ) 𝑙𝑜𝑔2 𝑁 and the
2𝜋𝑘𝑛 2𝜋𝑘𝑛 2
𝑋𝑘 [𝑛] = ∑ 𝑥𝑛 [cos ( ) − 𝑗 sin ( )] (2) number of complex additions from (𝑁 2 − 𝑁) to 𝑁𝑙𝑜𝑔2 𝑁 – 𝑁
𝑁 𝑁
𝑛 is the number of bits. As seen above, this algorithm also adds a
multiplication to the odd index of the DFT equation: e^(-
where the DFT’s complex coefficients are more explicit using
since acoustical engineering was unavailable at the time. He engineering firms when necessary. Brad has had several
sings, plays guitar and piano, and writes and records many internships prior to running his own business. He worked as a
different genres of music in his recording studio. His main systems engineering Co-Op at L3 Technologies, an acoustic
interest in electrical engineering is guitar amplifier design and engineer for Constantine Technologies, LLC, a
the digital signal processing that is used in sound engine DAWs network/systems intern for the Center for High-Performance
such as Logic Pro or FL Studio. Computing, and an electrical engineering intern at Blackrock
Brad currently works for himself as the owner of WirelessiM Microsystems. In all of these internships he has gained and
LLC – an electrical engineering company specializing in analog refined his software/hardware design, test, and debugging skills
and digital audio design. He also does various contract work for at an intermediate level.
APPENDIX A
DOCUMENTATION
I. COMPLEXITY
As mentioned previously, since I originally only thought to output the magnitude of frequency data to a display using parallel
ADC input, the scope was limited and simple. However, by adding a multiplexer that selects a SIPO or PIPO input, the core can
be used with several different ADCs to implement a real-time display or slightly delayed output to the display using eight or 16-
bit shift registers. Also, by adding the second output freq_data, the output can now go to a DAC, which increases the complexity
further if a PISO or PIPO shift register were used on the output side as well. The square root function could be more complex.
For sake of time to turn this in I couldn’t finish that function, but if used with a lookup table, the data wouldn’t require too much
ROM space on the chip and would provide more accurate results. I intend to finish this function by fabrication time. Other than
that, the paper above already mentioned the complexities of the FFT.
At first, I had wanted to make my own ADC on chip, but after hearing the extensive amount of time that would take for a group
of four, let alone one person, I decided to pass. This was a VERY GOOD IDEA. ADCs are inexpensive and come in many forms
from other manufacturers. The next thought I had was to connect to an off-chip ADC providing serial input data. This idea was
useful, but I didn’t like having to wait and batch my signals with our 10MHz clock to get the data. To be honest I’m not sure if I
would visibly notice a difference whether the clock was 10MHz or 100MHz with my output magnitude data, but I figured parallel
data would be insanely faster. This is when I researched ADCs and found that they have some with parallel outputs. Because this
bypassed my clocked shift register I worried whether or not this was a good idea. My flip-flop wasn’t ever used this way. I didn’t
think you would approve of that, so I added a batch shift register that would capture serial data from ADCs of that kind as well as
a synchronizer circuit, which I haven’t implemented yet, that synchronizes the parallel ADC input to my FFT core in case the ADC
5
clock is way too fast and shoves more bits than my core can compute at one time on the wire. I don’t expect this to be an issue
One of the best decisions I made was to reduce the complexity of the additions by completely removing the twiddle factors. The
math doesn’t actually require them at the hardware level. I’ll demonstrate this below in Fig. 1:
From the Radix-2 above you can see that after the first two stages of addition you need to multiply some of the additions by
The entire addition of the butterfly is explicitly calculated below where the hardware Verilog code is on the left and the actual
real world theoretical sum is on the right. If the actual sum is blank, this means that the hardware sum is exactly the same as the
actual sum. Also, if a sum is highlighted in red, that means hardware didn’t have to compute it due to cancellation of terms or
substitution tricks:
U0 <= T0 + T4; // U0 = T0 + T4 = S0 + S2 + S4 + S6 = adc[0] + adc[4] + adc[2] + adc[6] + adc[1] + adc[5] + adc[3] + adc[7]
(adc[3] + adc[7])*-1
U4 <= T0 + T4; // U4 = T0 + (-1*T4) = S0 + S2 + (S4 + S6)*-1 = adc[0] + adc[4] + adc[2] + adc[6] + (adc[1] + adc[5] + adc[3]
+ adc[7])*-1
U6 <= T2 + T6; // U6 = T2 + (-1*T6) = S0 + (-1*S2) + (S4 + (-1*S6))*-1 = adc[0] + adc[4] + (adc[2] + adc[6])*-1 + (adc[1] +
You can see that there is a terrible amount of math involved even for eight bits. But, as shown below in Figs. 2-5, there is a great
deal of substituting and cancelling due to double-negatives and imaginary numbers getting squared:
From the figures above, you may have been able to notice that double negatives cancel, j^2 causes subtractions instead of additions,
and that the 0.707 portion of the twiddle factor is squared in several places and then doubled which leaves a coefficient of
approximately 1. Then you can combine portions of the sums that are real valued separate from the sums that have an imaginary
r0 <= T0 + T4;
i0 <= 0;
r1 <= S1 - S7;
r2 <= T2 - T6;
i2 <= 0;
r3 <= S1 - S7;
i3 <= S3 + S5;
r4 <= T0 - T4;
i4 <= 0;
r5 <= S1 + S7;
r6 <= T2 + T6;
i6 <= 0;
r7 <= S1 + S7;
i7 <= S3 - S5;
It is interesting to note that due to cancellation, some of the T sums aren’t even used and many of imaginary bit indices end of
summing to zero.
The next idea I had was to add another output freq_data so that my FFT was general purpose. This was probably the best decision
I made so the chip can be coupled with several other external circuits. Also, I decided to make the square root function a normalizing
function instead of the actual function. I don’t like this decision, but in order to see results from my simulations I had to do it. I
plan to fix this before sending my core to MOSIS. The next thing I decided was to make my Verilog scalable. This was the second
most important decision because I could get better precision and handle more data in my core at one time depending how big my
core got with each scaling of 2 (8, 16, 32-bits, etc.). Right now, with 8-bits my code circuit is extremely small. I can make it bigger
for sure due to how I pre-calculated the butterfly radix-2 stages. I also think it would be really cool to put a super simple reverb
12
loop on my input ADC data just to see what it sounds like. I also didn’t realize that I could have reg signed input or reg signed
output which made simulation terrible until I realized how to properly code that syntax. That was a lifesaver.
To this point – taking into account that it’s just me and I’ve come this far since we met two days ago – I have tested the 8-bit
PIPO FFT using NC-Verilog and gotten accurate results. This code forms a perfectly normal fft_struct.v file in Synopsys. The 8-
bit SIPO is also tested and is working, but my counter isn’t working for whatever reason. It looks exactly like the other groups’
FFT code. Since the SIPO is just a shift register, the radix-2 still works without a counter and just computes the FFT every clock
cycle instead of every eight cycles – pushing the MSB off every positive edge of the clock.
I’ve been able to open Innovus with a generated Verilog structural file and it places my nets and my butterfly addition. This is
how I knew my circuit was able to be scalable to a higher bit resolution. Unfortunately, I haven’t been able to test my ROM,
which is a big deal I know, but I don’t think it will be too difficult. Otherwise over the break all I need to do is get my pad rings
and drive strengths connected once I get my ROM synthesized properly and then we’re in the clear for fabrication.
IV. TESTING
To simulate my Verilog code, I started out really simple. I just wanted my two-input radix-2 to work. It simulated with flying
colors. Then I went on to the 8-bit FFT. The code and simulation is shown below with no problems:
always @ (*)
begin
S0 <= adc[0] + adc[4];
S1 <= adc[0] - adc[4];
S2 <= adc[2] + adc[6];
S3 <= adc[6] - adc[2];
S4 <= adc[1] + adc[5];
S5 <= adc[1] - adc[5];
S6 <= adc[3] + adc[7];
S7 <= adc[7] - adc[3];
T0 <= S0 + S2;
T2 <= S0 - S2;
T4 <= S4 + S6;
T6 <= S6 - S4;
r0 <= T0 + T4;
13
i0 <= 0;
r1 <= S1 - S7;
i1 <= -S3 - S5;
r2 <= T2 - T6;
i2 <= 0;
r3 <= S1 - S7;
i3 <= S3 + S5;
r4 <= T0 - T4;
i4 <= 0;
r5 <= S1 + S7;
i5 <= -S3 + S5;
r6 <= T2 + T6;
i6 <= 0;
r7 <= S1 + S7;
i7 <= S3 - S5;
end
endmodule
********TESTBENCH**********
initial
begin
adc[7:0] = 8'b00000000;
end
14
I also simulated the SIPO shift register and got functional results – which will be fixed shortly – using this Verilog test bench
code:
always @ (a0)
begin
--DO ALL THE MATH --
end
endmodule
********TESTBENCH**********
initial
begin
adc0 = 1'b0;
clk = 1'b0;
clr = 1'b1;
#2
clr = 1'b0;
end
always #2 clk=~clk;
always #4 adc0 = ~adc0;
The output display_data keeps a consistent value because my input (adc) is giving it the same value once all eight slots of the
batch shift register are filled. Next shown is the symbol schematic for my circuit:
Thus, you can see the eight parallel ADC inputs, the one serial input, and the two 15-bit outputs display_data and freq_data. I
will test this schematic in Spectre this weekend for analog verification. To verify my chip before it is fabricated, I will use the
Xilinx FPGA board with specified fake ADC values and light up the 8 LEDs on the board in correspondence with the magnitude
amplitudes. To actually test the chip when it is fabricated, I intend to buy an 8x8 bit LED display, with my guitar connected to an
APPENDIX B
LIBRARY DOCUMENTATION
AOI2X1: Fig. 9. AOI2X2 Schematic.
BUF4X: BUF8X:
DFFX1:
Fig. 17. INVX1 Schematic. Figs. 19-21. FILL1, FILL4, FILL8 Layout.
20
INVX4: INVX8:
MUX2X1: NAND2X1:
NANDX2: NAND2X4:
NAND3X1: NOR2X1:
NOR2X4: NOR3X1:
NORX2: OAI2X1:
OAI2X2: TIEHI:
TIEHI4X: TIELO:
TIELO4X: XOR2X1:
Fig. 59. Enabled Inverter Layout. Fig. 61. Transmission gate Layout.
30
Cell Descriptions: 0 0 1
0 1 1
General 1 0 1
o The D Flip Flop (DFFX1) is rising edge triggered, 1 1 0
master slave flip flop with asynchronous active-low
clear o NAND2X4
Inputs Output
o Any cell that has X2, X4, X8, or 4X in its name has a A B F
0 0 1
drive strength multiple of the X1 version i.e. INV8 has
0 1 1
eight times the logical drive strength than that of the 1 0 1
INVX1. The areas also are generally double. 1 1 0
o BUF4X o NOR2X4
Input Output Inputs Output
In Out A B F
0 0 0 0 1
1 1 0 1 0
1 0 0
o BUF8X 1 1 1
Input output
In Out o NOR2X8
0 0 Inputs Output
1 1 A B F
0 0 1
o NANDX2 0 1 0
Inputs Output 1 0 0
A B F 1 1 1
0 0 1
0 1 1 o TIEHI
1 0 1 Output
1 1 0 F
1
o NAND2X1
Inputs Output o TIEHI4X
A B F Output
0 0 1 F
0 1 1 1
1 0 1
1 1 0 o TIELO
Output
o NAND2X1 F
Inputs Output 1
A B F
31
o TIELO4X o OAI2X2
Output Inputs Output
F A B C F
1 0 0 0 1
0 0 1 1
o NAND3X1 0 1 0 1
Inputs Output 0 1 1 0
A B C F 1 0 0 0
0 0 0 1 1 0 1 0
0 0 1 1 1 1 0 0
0 1 0 1 1 1 1 0
0 1 1 1
1 0 0 1 o MUX2X1
1 0 1 1 Inputs Output
1 1 0 1 S P0 P1 F
1 1 1 0 0 0 0 0
o NOR3X1 0 0 1 1
Inputs Output 1 1 0 1
A B C F 1 1 1 1
0 0 0 1
0 0 1 0 o XOR2X1
0 1 0 0 Inputs Output
0 1 1 1 A B F
1 0 0 0 0 0 0
1 0 1 1 0 1 1
1 1 0 1 1 0 1
1 1 1 1 1 1 0
o OAI2X1
Inputs Output
A B C F
0 0 0 1
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 0