Course Slides 2018
Course Slides 2018
Course Slides 2018
net/publication/325581786
CITATION READS
1 11,629
1 author:
Reza Sameni
Emory University
163 PUBLICATIONS 4,598 CITATIONS
SEE PROFILE
All content following this page was uploaded by Reza Sameni on 05 June 2018.
Some images and source codes (cited within the text) have been
adopted from books, papers, datasheets, and the World Wide
Web; but may be subject to copyright.
An ideal VN Pipeline
11
• Rapid prototyping
reduced time-to-market
• In-system customization
hardware updates and patches
• Remote reconfiguration
via RF links for telecommunication BTS, spacecrafts, satellites,…
• Multi-modal computation
Environment aware hardware
• Adaptive computing systems
Machine learning applications
15
References
PLD Technologies
Programmable Logic Devices (PLD) have a long history
(longer than conventional VN architecture CPUs):
• PROM
• Logic Chips
• SPLD: PLA & PAL
• CPLD
• FPGA
• ASIC
18
• CMOS Technology:
• The 4xxx-series
21
74000 Sub-series
• 74LS74: Low-power Schottky
• 74HCT74: High-speed CMOS
• 74HCT: 74LS TTL-compatible inputs
• SN74F00: Fast logic
22
Reference: http://en.wikipedia.org/wiki/7400_series
24
f ( x, y, z) xy xyz
Technologies:
• Simple PLD (SPLD)
• Complex PLD (CPLD)
25
Example
27
Macrocell
Programmable Switches
PLD Programming
JTAG
JTAG is a serial interface technology. The connector pins are:
•TDI: Test Data In
•TDO: Test Data Out
•TCK: Test Clock
•TMS: Test Mode Select
•TRST: Test Reset (optional)
Ref: https://www.xjtag.com/about-jtag/jtag-a-technical-overview
44
PLD Packages
• Plastic Leaded Chip Carrier (PLCC)
48
PLD Packages
• Small Outline Integrated Circuit (SOIC)
• Plastic Small Outline Package (PSOP)
49
PLD Packages
• Thin Small Outline Package
50
PLD Packages
• Pin Grid Array (PGA)
51
PLD Packages
• Ball Grid Array (BGA)
52
Xilinx®
http://www.xilinx.com/
54
Altera®
http://www.altera.com/
Actel®
http://www.actel.com/
56
Lattice®
http://www.latticesemi.com/
57
QuickLogic®
http://www.quicklogic.com/
58
Reference: http://www.eetimes.com /
59
Reference: www.xilinx.com
60
Reference: www.xilinx.com
61
Reference: https://www.gminsights.com/industry-analysis/field-programmable-
gate-array-fpga-market-size
62
References:
• S. Brown and Z. Vranesic, Fundamentals of Digital Logic
with Verilog Design, McGraw-Hill, 2003, Chapter 3
• B. Zeidman, Designing with FPGAs & CPLDs, CMP Books,
2002, Chapter 1
FPGA INTERNAL
ARCHITECTURE
64
From top to bottom the logic blocks become more complex and
advanced.
Node-Based Reconfigurable Architectures: Imagine a network
of computers and programmable devices, which can be
reconfigured on-demand
Current FPGA architectures are considered medium grain in this
classification
66
The left part slices of a CLB (SLICEM) can be configured either as combinatorial logic, or can be use as 16-bit SRAM or as
shift register while right-hand slices. The SLICEL can only be configured as combinatorial logic.
73
Slicing
74
Why?
75
Interconnect Networks
79
Interconnect Networks
80
Clock Trees
84
Clock Management
Usage:
1. Jitter removal
2. Frequency synthesis
3. Phase shifting
4. Clock de-skewing
85
1. Jitter Removal
86
Jitter Specifications
87
Jitter Calculation
90
Jitter Calculation
91
Jitter Calculation
Example 2 (Cascaded DCMs)
Assume that the input clock has 150 ps (±75 ps) of period jitter. Assume that DCM (A) uses
the CLK2X output. Use the Spartan-3 Data Sheet specification called CLKOUT_PER_JITT_2X
for the DCM output jitter, estimated here as 400 ps (±200 ps). Assume that DCM (B) uses the
CLKDV output with an integer divider value. Use the Spartan-3 Data Sheet specification called
CLKOUT_PER_JITT_DV1 for the DCM output jitter, estimated here as 300 ps (±150 ps).
Finally, assume that DCM (C) phase shifts the output from DCM (B) by 90°. Use the Spartan-3
Data Sheet specification called CLKOUT_PER_JITT_90 for the DCM output jitter, estimated
here as 300 ps (±150 ps).
92
2. Frequency Synthesis
94
3. Phase Shifting
95
4. Clock De-skewing
96
Further Reading:
Basics of DLLs: https://open4tech.com/phase-and-delay-locked-loops-basics
Control Models of PLLs and DLLs: http://pages.hmc.edu/harris/cmosvlsi/4e/lect/lect22.pdf
100
Reference: http://www.xilinx.com/support/documentation/application_notes/xapp462.pdf
101
Reference: http://www.xilinx.com/support/documentation/application_notes/xapp462.pdf
105
DCM Cascading
106
IP Cores
• Hard IP
• In the form of pre-implemented blocks such as microprocessor
cores, gigabit interfaces, multipliers, adders, MAC functions, etc.
Example: Xilinx PowerPC
• Soft IP
• Source-level library of high-level functions that can be integrated in
a custom design.
• Firm IP
• Libraries which have already been optimally mapped, placed, and
routed into a group of programmable logic blocks (and possibly
combined with some hard IP blocks like multipliers, etc.) and may
be integrated into a custom design.
Example: Xilinx MicroBlaze
113
• Question: Why?
Reference: http://www.gartner.com
117
Microprocessor FPGA
Architectural design Architectural design
Choice of language (C, JAVA, etc.) Choice of language (Verilog, VHDL. etc.)
1. Design Entry
Utilities for design entries:
• Schematic Editors
• e.g., Altium’s FPGA-ready Design Components and FPGA Generic
2. Functional Simulation
• Behavioral Simulation; not necessarily implementable on hardware
• Structural Simulation; can simulate bitwise accurate models of the
final hardware
127
3. Logic Synthesis
HDL Boolean Equations Technology Mapping
Hardware Description
How can we describe a hardware?
1. Schematic design tools: Visual schematic editors. e.g.,
Altium®, Protel®, OrCAD®, Xilinx PlanAhead®, etc.
2. Hardware description languages: Verilog, VHDL, etc.
3. Set of libraries and classes in software languages
4. Any other?
132
Full-Adder
137
Hardware Description
What should a HDL look like and what features should it have?
1. Cover different levels of abstraction: transistor level, gate level,
register transfer level (RTL), system level
2. Applicable for different architectures: CPLD, FPGA, ASIC, etc.
3. Provide a unique description for all synthesizable hardware
4. Ability of accurate simulation before implementation. The language
should be able to simulate other functionalities required for
hardware description and simulation: generating synthetic
waveforms, reading/writing test vectors from/to files, setting time
bases, etc.
5. Convertible into conventional data structures such as trees and
graphs for algorithmic simplifications and optimizations
6. Existence of tools (tool chains) for translating the “hardware
description” into “hardware”
139
Verilog HDL
We use Verilog HDL in this course, because
• It has all the required features of a complete HDL
• It has a rather simple syntax
• It is not as verbose as VHDL
• It is highly popular in industry (for RTL design)
These elements are from different levels of abstraction; but any HDL should
be able to “describe” them.
143
Logic Values
• Verilog supports four logic values
Logic Value Description/Usage
0 zero, low or false
1 one, high or true
z or Z high-impedance, tristates, floating (dangling)
x or X unknown, uninitialized, collision
Assignments
• The assign keyword is used to connect wires and to define
single-line combinational logic.
147
Module Definition
inputs module
output
module name
T O
wire
instances
Instance name
Inout mechanism
150
Note: Port order is not important when using “by name” mapping
Comments
152
Always Blocks
• An always block is used to define, both, combinational and sequential
logic blocks.
• Registers may only be assigned inside an always block (although they
may represent combinational logic).
• Variables assigned in an always block should all be defined as reg
Flip-flop inferred
sensitivity list
equivalent
No flip-flops inferred!
equivalent
154
Always Blocks
The following two pieces of code are identical (five flip-flops are inferred in total):
We see that the always block has abbreviated the explicit declaration of five flip-
flops
Note: All always procedures with the same sensitivity list are concurrent. They
describe parallel flip-flops, which share a common clock.
Note: The sequence of writing wire assignments, always blocks and their internal
assignments are irrelevant; timing is manages by data-flow and state controllers,
not by code line execution orders
Question: What issues can raise when code line sequences become irrelevant?
157
Answer: The Verilog syntax does not allow this (a register may only be assigned
in a single always block). Problem solved!...
vs.
Non-blocking assignment: All assignments that use the variable are deferred until all right-hand
sides have been evaluated (end of simulation time-step)
Guideline: Blocking assignments are only used for combinational logic description. Use non-
blocking assignments for sequential register assignment.
Further Reading: http://courses.csail.mit.edu/6.111/f2007/handouts/L06.pdf
160
Answer: Race condition; no syntactic solutions exist for this issue. Should
be avoided/resolved by proper design.
Example: Passing data between different clock domains.
161
Ref: http://referencedesigner.com/tutorials/verilog/verilog_23.php
166
For-Loops in Verilog
• For-loops in their software-like
usage are not synthesizable in
Verilog.
• Question: Why?
• In synthesizable Verilog codes,
for-loops are merely used for
writing shorter scripts that
generate codes.
• We will learn alternative code
generation methods in later
sections.
167
For-Loops in VHDL
168
• Major Reference: Xilinx XST User Guide, UG627 (v 11.3) September 16, 2009.
URL: https://www.xilinx.com/support/documentation/sw_manuals/xilinx11/xst.pdf
169
Verilog VHDL
alternative form
178
Common Buffers
• Buffers may also be used as built-in primitives.
Gate Description
not Output inverter
buf Output buffer.
bufif0 Tri-state buffer, Active low enable.
bufif1 Tri-state buffer, Active high enable.
notif0 Tristate inverter, Low enable.
notif1 Tristate inverter, High enable.
Example:
bufif0 (weak1, pull0) #(4,5,3) (data_out, data_in, ctrl);
180
Sample applications:
• FIFO valid data counter
• Chirp signal generator
186
Verilog VHDL
195
Verilog VHDL
196
Multiplexers in Verilog
• If-Then-Else or Case can be used for multiplexers (MUXs) description.
• If one describes a MUX using a Case statement, and does not specify
all values of the selector, the result may be latches instead of a
multiplexer. When writing MUXs, one can use don’t care to describe
selector values.
• XST decides whether to infer the MUXs during the Macro Inference
step. If the MUX has several inputs that are the same, XST can decide
not to infer it. One can use the MUX_EXTRACT constraint to force XST
to infer the MUX.
• Verilog Case statements can be: full or not full; parallel or not parallel
• A Verilog Case statement is:
• Full: if all possible branches are specified
• Parallel: if it does not contain branches that can be executed
simultaneously
201
Multiplexers in Verilog
Multiplexers in Verilog
Verilog VHDL
204
Verilog VHDL
205
Verilog VHDL
206
Verilog VHDL
One-Hot Decoders
Verilog VHDL
208
One-Cold Decoders
Verilog VHDL
209
Verilog VHDL
210
Verilog VHDL
211
Priority Encoders
Verilog VHDL
Verilog VHDL
213
Verilog VHDL
214
Verilog VHDL
215
Unsigned Adder
Verilog VHDL
216
Verilog VHDL
217
Verilog VHDL
218
Verilog VHDL
219
Signed Adder
Verilog VHDL
220
Unsigned Subtractor
Verilog VHDL
221
Verilog VHDL
222
Unsigned Adder/Subtractor
Verilog VHDL
223
Verilog VHDL
224
Unsigned Multiplier
Verilog VHDL
225
Note:
Considering that (ar + jai)(br + jbi) = (arbr - aibi) + j(arbi + aibr):
• The first two first cycles compute:
Res_real = A_real * B_real - A_imag * B_imag
• The second two cycles compute:
Res_imag = A_real * B_imag + A_imag * B_real
226
Pipelining
• Pipelining is a general technique for improving design timing and hardware
utilization efficiency by using parallel units that simultaneously process the
output of preceding stages of the pipeline.
• Implementing combinational logic using pipelines can significantly reduce the
critical path delay.
228
A Few Definitions
• (Input-Output) Latency: the
amount of time it takes to Example:
travel through the pipe.
• Critical Path: Longest
combinational path between
the output of one flip-flop to
the input of another flip-flop
(sharing a common clock)
• Throughput: The maximum
Throughput = one task every three days
rate of data flowing in or our of Latency = is input-output path dependent
3ns 8ns
clock
clock
New critical path = 5ns, Max Throughput = 200MHz, I/O Latency = 4 clocks (20ns @ fclock=200MHz)
(we will discuss much more about pipelining in digital systems design up to end of the course)
230
Verilog VHDL
Verilog VHDL
232
Verilog VHDL
Notes:
Dividers are supported only when the
divisor is a constant and is a power of 2.
In that case, the operator is
implemented as a shifter. Otherwise,
XST issues an error message.
IP cores or custom code can be used
for other divisors.
(we will discuss much more about resource sharing in digital systems design up to end of the course)
240
Verilog VHDL
241
Verilog VHDL
242
Template 1 Template 2
243
Template 1 Template 2
244
Verilog VHDL
245
Verilog VHDL
246
Verilog VHDL
248
Verilog VHDL
249
Verilog VHDL
253
Verilog VHDL
254
Verilog VHDL
255
Verilog VHDL
259
Verilog VHDL
260
Verilog VHDL
262
Verilog VHDL
266
Verilog VHDL
267
Verilog VHDL
269
𝑠𝑘+1 = 𝑓(𝑠𝑘 , 𝑥𝑘 )
Mealy Machine:
𝑦𝑘 = 𝑔(𝑠𝑘 , 𝑥𝑘 )
𝑠𝑘+1 = 𝑓(𝑠𝑘 , 𝑥𝑘 )
Moore Machine:
𝑦𝑘 = 𝑔(𝑠𝑘 )
Research Topic: According to the above representation, Mealy and Moore machines
can be studied from a state-space perspective. The rich literature of state-space
analysis from Control Theory can be used to study the properties of logic circuits.
272
Ref: https://www.electronics-tutorials.ws/combination/comb_5.html
274
Ref: https://www.electronics-tutorials.ws/combination/comb_5.html
275
Verilog VHDL
277
Verilog VHDL
Summary
clock
data
286
Debouncing
• In digital designs, bouncing
(between 0 and 1) occurs during
manual switch transitions
• The objective of debouncing is to
avoid the mis-detection or multiple
counting of events during switch
transitions
• Debouncing can be implemented
both in hardware (analog) and
software (digital)
Reference: Arora, M. (2011). The art of hardware
architecture: Design methods and techniques
for digital circuits. Springer Science & Business
Media, Chapter 8
289
FPGA
Ref: https://eewiki.net/pages/viewpage.action?pageId=13599139
OVERVIEW OF LOGIC
SYNTHESIS METHODS*
(Optional)
293
Note 1: XST performs a resource sharing check. This usually leads to a reduction
of the area as well as an increase in the clock frequency.
Note: This is where the term Register Transfer Level (RTL) comes from
298
Node Synthesis
• Two-level Logic Synthesis
• Deals with the synthesis of designs represented in two-level logic. The longest path from
input to output, in term of number of gates crossed on the path, is two.
• Two-level logic is the natural and straightforward approach to implement a Boolean
function, because each Boolean function can be represented as a sum of product terms.
• In the first level, the products are built using the AND primitives. The sums of the
resulting products are built in the second level with the OR-primitives.
• Used for CPLD
Node Representation
1. Sum of Products (SOP) Form
2. Factored Form
• a product is either a single literal or the product of two factored forms and a sum
is either a single literal or the sum of two factored forms.
• Factored forms are representative of the logic complexity.
• Sign Representation:
• Unsigned
• Signed
Sign-Magnitude Representation
The MSB is reserved for sign representation (0 for + and 1 for –). The
remaining bits are used to represent the absolute magnitude. With N bits, it
can code from –(2N-1 –1) to (2N-1 –1).
One’s Complement
The MSB denotes the sign (0 for + and 1 for –). With N bits, it can code from
–(2N-1 – 1) to (2N-1 – 1). Each bit corresponds to a coefficient of a power of
two in its decimal equivalent.
all bits one
Two’s Complement
The MSB denotes the sign (0 for + and 1 for –). With N bits, it can code from
–2N-1 to (2N-1 – 1). Each bit corresponds to a coefficient of a power of two in
its decimal equivalent.
doesn’t fit into N bits
Advantage: No repeated zeros; can code –2N-1; no sign control needed during
arithmetic operations, and several other advantages (is the most popular
signed number representation format)
Disadvantage: Slightly more difficult to read the decimal equivalent from the
binary form (for human).
313
2’s Complement:
• Method 1: Calculate the 1’s complement, plus one
• Method 2: Subtract the number from 2N (this is where the name
2’s complement comes from)
• Method 3: Starting from the LSB, preserve all the bits as they are,
up to (and including) the right most 1. Flip all the remaining bits up
to the MSB
Note: The 2’s complement of –2N-1 can not be represented in N bits. Therefore,
during calculations, it’s 2’s complement overflows and becomes equal to itself
(just like the 2’s complement of zero)! This phenomenon can be mathematically
explained by the orbit-stabilizer theorem.
315
4. Overflow check: If two numbers with the same sign are added,
overflow occurs if and only if the result has an opposite sign.
Example:
317
Refs:
• Khan, S. A. (2011). Digital design of signal processing systems: a practical
approach. John Wiley & Sons., Section 3.5.7
• Smith, J. O. (2007). Introduction to digital filters: with audio applications (Vol. 2).
Julius Smith., P. 201
Note: Very interesting property; but I haven’t seen a rigorous statement or proof for it, yet.
Please let me know, if you find a good reference.
318
Ref: https://en.wikipedia.org/wiki/Binary-coded_decimal
320
Further Reading: Khan, S. A. (2011). Digital design of signal processing systems: a practical approach. John
Wiley & Sons., Chapter 6
321
significand base
0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1
• • • • •
b31 b30 b23 b22 b0
The exponent is selected such that the
• The decimal equivalent is: left-most bit of the mantissa is always 1
(which isn’t stored in the binary form),
𝑀 making the representation unique.
0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1
• • • • •
b31 b30 b23 b22 b0
which is 0x40490FDB = (0100 0000 0100 1001 0000 1111 1101 1011)2
A nice tool: http://www.binaryconvert.com
325
Floating-Point Arithmetic
Addition/Subtraction:
1. Make the smallest exponent equal to the biggest (by right-shifting the mantissa)
2. Add/subtract the mantissas (note that the smaller ones may vanish to 0 during
the right-shifts)
Multiplication/Division:
1. Add/subtract the exponents
2. Multiply/Divide the mantissas
3. Scale and round the results
Special Values:
Floating-point representation has reserved codes for special values including: 0+, 0–,
+∞, -∞, and Not-a-Number (NaN) such as 0/0, +∞/-∞, 0×∞
Note: Due to the (implicit) leading 1 in front of the mantissa, zero needs to be defined
as a special value (when all the bits of the exponent and mantissa are zero), which is
different from epsilon (±2−127 )
326
0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1
•
bN-1 b0
• The decimal equivalent is: radix point
signed two‘s
𝑁−2 complement
𝑋10 = 2−𝑀 × (−𝑏𝑁−1 2𝑁−1 + 𝑏𝑖 2𝑖 )
𝑖=0
where:
• 𝑁 is the total number of bits
• 𝑀 is the fractional point
Note: In fixed-point systems the radix point location is assumed to be fixed throughout
the entire system. That’s where the name comes from.
327
floating-point numbers
328
Fixed-Point Arithmetic
Addition/Subtraction:
1. Align the radix points
2. Zero pad the LSB of numbers with shorter fractional lengths
3. Sign extend the MSB of numbers with shorter integer lengths
4. Apply addition/subtraction
Multiplication/Division:
1. Apply multiplication/division as if they were integer valued (regardless of
the radix point)
2. Find the appropriate radix point by adding/subtracting the radix points
1 1 1 1 1 0 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0
bN-1 b0
332
• •-2 p-1 0
•
2p-1-1
e • •-2 m-1 0
•
2m-1-1
x
slightly biases
Error mean: 2𝑝−1 −1 towards negative
1 1 numbers truncation point
𝑒 = 𝐸 𝑒𝑛 = 𝑖∙ 𝑝 =−
𝑝−1
2 2 preserved bits (yn) discarded bits (en)
Error variance: 𝑖=−2
1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0
2𝑝−1 −1
1 2 1 22𝑝 − 1 total bits (xn)
𝜎𝑒2 = 𝐸 (𝑒𝑛 − 𝑒)2 = (𝑖 + ) ∙ 𝑝 =
𝑝−1
2 2 12
𝑖=−2
334
22𝑚
SNR dB ≈ 10log10 2𝑝
= 20(m − p)log10 2 ≈ 6.02(𝑚 − 𝑝)
2
Note: This is the 6dB per-bit rule of thumb: truncating each bit reduces the SNR for
about 6dB. We will find a similar rule later for ADC performance with different signal
and noise distributions.
Exercise: Derive the above equations (mean and variance of error) analytically. Do
the results change if the number is in the Qm.n format?
335
2-p 2-p
… …
•• 0 2p-1
• e
• •-2 p-1 0 2p-1-1
• e
3.25
3.25 0 1 1 0 1
337
integer fraction
x: 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1
xN-1 xN-P-1 x0 X
y: 0 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1
yN-1 y0
3.3125 0 1 1 0 1 0 1 3.3
G = log 2 hm
m
xn yn
hn
B bits (impulse response) (B + G) bits
Note: From Signals & Systems Theory we know that for a stable causal filter 𝑚 ℎ𝑚 = B < ∞.
Therefore “the output of a stable filter with a bounded input can always be stored in a register
of finite length without overflow”
340
2. Random input signals: Using Parseval’s theorem, the output variance of a filter with
a random input is related to its input variance as follows:
𝜎𝑦2 = 𝜎𝑥2 |hm |2
𝑚
Therefore, with the following bit-growth, the probability of overflow at a filter’s output
is (almost) equal to the probability of input overflow:
Bit Growth G1 = log 2 hm 2
𝑚
343
Result: In this example the L1-norm and narrow-band assumption, both demand 4
additional bits at the output yn; but according to the output variance criterion if we are
fine with occasional overflows, adding only 2 bits is statistically OK.
ANALOG TO DIGITAL
CONVERTORS AND DIGITAL
TO ANALOG CONVERTORS
345
digital signal
FPGA
analog signal x[n]
quantization and
x(t) anti-aliasing filter sample and hold
sample encoding @fs
time-domain ADC
discretization
amplitude
discretization
y[n] y(t)
DAC anti-imaging filter
346
Ref: https://en.wikipedia.org/wiki/Nyquist-Shannon_sampling_theorem
Further Reading: Alan V. Oppenheim, Alan S. Willsky, and S. Hamid
Nawab. Signals & Systems (2nd Ed.). Prentice-Hall, Inc., 1996
reconstructed signal
347
+∞ ∆2 𝑓e(e)
𝜎𝑒2 = 𝐸 (𝑒 − 𝑒)2 = −∞
(𝑒 − 𝑒)2 𝑓𝑒 𝑒 d𝑒 =
12
We have calculated the denominator of the SNR equation. In 1/∆
the sequel we consider three cases for the input signal:
Sinusoidal (deterministic) signal, Gaussian distributed
stochastic signal, Uniformly distributed stochastic signal −∆/2 +∆/2 e
Quantization error probability
density function
350
or
SNR dB ≈ 6.02B + 1.76dB
Note: This is the well-known 6dB per-bit rule, which should be memorized as a
rule of thumb by any hardware engineer!
351
or
SNR dB ≈ 6.02B
Note: The 1.76dB is no longer there, but we still see the 6dB per-bit property.
352
Non-ideal ADC
• Practical ADC circuitry are never ideal and do not reach
their nominal performance (SNR=6.02B + 1.76dB).
• The standard approach to measure the true performance
of an ADC is by giving it a sinusoidal input signal with an
amplitude of 1dB below full-scale (to avoid overflow) and
measuring the real SNR and the effective number of bits
(ENOB):
True SNR measured by giving a
full dynamic-range sinusoidal to
the ADC and measuring the SNR
of an acquired block of data
SNR dB − 1.76dB
ENOB =
The effective number of bits; a 6.02
real-value, always smaller
than the nominal number of
ADC bits (ENOB < B)
354
ENOB Examples
• AD9246 14-Bit, 80 MSPS/105 MSPS/125 MSPS, 1.8 V Analog-to-
Digital Converter:
355
2. Making the input sequence distribution uniform: A useful theorem from random
variables:
If a random variable (RV) x with a probability density function (pdf) fX(x) and
cumulative distribution function (CDF) FX(x) passes a nonlinear memoryless
system with a characteristics u = FX(x), the output u is uniformly distributed. Also, if
a uniformly distributed RV u is given to y = FX-1(u), the output has a distribution
fX(•).
Note: This property can be used to make arbitrary RVs from uniform distributions
and vice versa in FPGA.
356
Xs(f) Xs(f)
E(f) E’(f)
-fs-B -fs -fs+B -B +B fs-B fs fs+B f -fs-B -fs -fs+B -B +B fs-B fs fs+B f
357
Note: Dithering improves the SFDR at a cost of decreasing the SNR (increasing the noise floor)
Note: Dithers can be generated in FPGA using linear-feedback shift registers (LFSR)
360
Background
• Real-world applications require the representation of real-valued
data in floating-point or fixed-point formats
• Real numbers can be approximated in these formats using the
necessary number of bits and by proper scaling
Real System:
FPGA
x0(t) x´(t) x´s[n] y´s[n]
Analog
+ Front-End
+ ADC + Processing +
input noise front-end noise ADC quantization noise round-off error noise
365
Digital
Processing
I
× hi[n]
cn=cos(ωn)
|x(t)|<1 16-bit xs[n]
DDS Processing
ADC
sn=sin(ωn)
Note 1: The internal register lengths are selected according to the input noise level and ENOB, not the ADC number of bits
Note 2: The SNR can be increased due to the processing gain. For example, remember the SNR improvement due to over-
sampling noted in the previous section
ARBITRARY WAVEFORM
GENERATION
368
Waveform Generation
The calculation/generation of arbitrary functions/waveforms of the
form y = f(x) is required in many computational and signal processing
applications. We study several methods for this purpose:
• Arbitrary functions:
• Direct Implementations (functional calculation)
• Lookup-Tables & Interpolated Lookup-Tables
• Special functions:
• CORDIC machines
• Periodic functions:
• NCO and Periodic Waveform Generators
• Recursive Oscillators
• Random signal:
• LFSR
369
d2N-1
371
xMSB xLSB
x: 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1
xN-1 xN-P-1 x0
d0
x xMSB (P bits) x1 y1=f(x1)
Note: Similar ideas can be implemented using quadratic and spline interpolations. See the following
reference for further ideas and general LUT-based methods: Behrooz, P. (2000). Computer arithmetic:
Algorithms and hardware designs. Oxford University Press, Chapter 24
374
Sine wave
Clock (Fs) NCO 3
Increment
Accumulator +
N
LUT Address Mixer value
CORDIC Machines
• The direct implementation of arbitrary functions requires
considerable logic resources and LUT-based methods
require considerable memory.
• Classes of mathematical functions can be generated with a
combination of small-size LUTs and set of shifts and
adds/subtracts.
• The Coordinate Rotation Digital Computer (CORDIC) is
one such method
• The CORDIC machine was invented in 1956 by Jack E.
Volder to be used in B58 bomber's navigation system for
accurate real-time digital calculations
378
𝑥𝑛+1 = 𝑥𝑛 − 𝑑𝑛 𝑦𝑛 2−𝑛
𝑦𝑛+1 = 𝑦𝑛 + 𝑑𝑛 𝑥𝑛 2−𝑛
𝑧𝑛+1 = 𝑧𝑛 − 𝑑𝑛 arctan 2−𝑛
where
• arctan 2−𝑛 are pre-calculated and stored in a LUT
• 𝑑𝑛 = sign(𝑧𝑛 ) (+1 if 𝑧𝑛 ≥ 0 and −1 if 𝑧𝑛 < 0)
∞ −𝑛
If |𝑧𝑛 | < 𝜃𝑚𝑎𝑥 = 𝑛=0 arctan 2 = 1.7432866 …, it can be shown that:
𝑥𝑛 𝑥0 cos 𝑧0 − 𝑦0 sin 𝑧0
lim 𝑦𝑛 = 𝐾 × 𝑥0 sin 𝑧0 + 𝑦0 cos 𝑧0
𝑛→∞
𝑧𝑛 0
∞
where 𝐾 = 𝑛=0 1 + 2−2𝑛 = 1.6467603 …
379
1 if 𝑡𝑛 ≥ 0
𝑑𝑛 =
−1 otherwise
380
• The term cos 𝑤𝑛 = 1/ 1 + 2−2𝑛 is the only required multiplication, which can
be omitted, as it does not alter the rotation angles and only changes the
vector magnitudes.
• Alternatively, depending on the number of iterations 𝑃, A = 1/ 𝑃 𝑛=0 1 + 2
−2𝑛
can be compensated as a constant multiplier.
381
Reference and further reading: Muller, Jean-Michel. Elementary functions. Birkhäuser Boston, 2006. Chapter 7
382
Common Clock
New Data
x0 State
• Resource Shared: y0 Controller
z0 Data Ready
Stage
PARAMS
xN-1 xN
yN-1 Single yN
zN-1 Stage zN
CORDIC
384
Ref: https://en.wikipedia.org/wiki/Linear-feedback_shift_register
387
Ref: Stavinov, Evgeni. 100 power tips for FPGA designers. Evgeni Stavinov, 2011
390
u ~ U(0,1) y ~ fX(x)
FX-1(u)
Random or pseudo-random
uniformly distributed variable
391
Background
• The notion of pipelining was introduced before, as a means of improving
the design timing, to achieve the design constraints (clock speed)
• Different techniques for pipelining and timing improvement in FPGA
systems are presented in this section, including:
• Retiming
• Re-pipelining
• Cut-set retiming
• C-slow retiming
• Pipelining in feedback systems
References:
• Hauck, Scott, and Andre DeHon. Reconfigurable computing: the theory and
practice of FPGA-based computation. Vol. 1. Elsevier, 2010, Chapter 18
• Khan, Shoab Ahmed. Digital design of signal processing systems: a
practical approach. John Wiley & Sons, 2011, Chapter 7
394
Retiming
• Retiming consists of reducing the critical path (increasing the clock
speed) by moving the pipeline registers to an “optimal position”.
Example: In the following, each circle denotes combination logic, with
the number representing the combinational latency
Retiming (continued)
• For systematic retiming, a digital circuit is converted to a data flow graph
(DFG). Next, by using graph theory based theorems, the registers are
systematically moved across the computational nodes (combinational
logic), without changing the input/output transfer function of the original
DFG.
Delay Transfer Theorem: “without affecting the transfer function of the
system, registers can be transferred from each incoming edge of a node
of a DFG, to all outgoing edges of the same node or vice versa” [Khan,
2011, p. 304].
396
Retiming (continued)
• Retiming can also be used to merge excess registers to reduce the area
utilization.
Example:
397
Peripheral Retiming
• In this technique: 1) all the internal registers are shifted to the input or output of
the design; 2) the combinational logic is simplified; finally 3) the registers are
pushed to their optimal position by conventional retiming.
Example:
(1) (2)
(3)
399
Re-pipelining
• In feed-forward designs, re-pipelining adds additional registers at
the input or output and then moves these registers across the
design (by retiming) to obtain the best performance.
• The cost of re-pipelining is the additional number of registers
added to the pipeline which adds a constant clock latency
between the input and output; but the other properties of the
design are preserved.
Cut-set Retiming
• More generally, cut-set retiming permits the addition of arbitrary number of
registers in a forward path, or moving registers from the input to the output (or
vise versa) of a cut-set, while preserving the I/O transfer function.
• Reminder: In Graph theory, a cut is a virtual partitioning of the edges of a graph
into two disjoint subsets, known as cut-sets.
C-Slow Retiming
• C-slow retiming consists of replicating all the registers of a synchronous design C
times, followed by moving the registers (conventional retiming), or by splitting the
circuit into C distinct parallel paths which multiplex and switch between the input data
and results.
Note: The design interleaves between two computations (2-slow): on the first clock
cycle, it accepts the first input for the first data stream; on the second clock cycle, it
accepts the first input for the second stream, and on the third it accepts the second
input for the first stream. Due to the interleaved nature of the design, the two streams
of execution will never interfere (on odd clock cycles, the first stream of execution
accepts input; on even clock cycles, the second stream accepts input).
408
original circuit
xr[0] xi[0] xr[1] xi[1] xr[2] xi[2] xr[3] … yr[0] yi[0] yr[1] yi[1] yr[2] yi[2] yr[3] …
411
Original circuit
Original circuit
a switch
416
Solution?
417
Further Reading
• Further reading on pipelining, folding and unfolding techniques for feed-forward
and feedback systems:
1. Khan, Shoab Ahmed. Digital design of signal processing systems: a practical
approach. John Wiley & Sons, 2011, Chapter 7.
2. Meyer-Baese, Uwe, and U. Meyer-Baese. Digital signal processing with field
programmable gate arrays. Vol. 2. Berlin: Springer, 2004, Chapter 4.
3. Hauck, Scott, and Andre DeHon. Reconfigurable computing: the theory and
practice of FPGA-based computation. Vol. 1. Elsevier, 2010, Chapter 18
METASTABILITY & MULTIPLE
CLOCK DOMAINS
422
Introduction
• Up to now, we have considered flip-flops and other logic devices as fully
deterministic elements.
• However, in reality, no two flip-flops are “exactly” the same. The (minor) deviations
between the electronic aspects and fabrication indeterminacies of these elements
result in stochastic behaviors.
• Although current FPGA vendors guarantee extremely robust behaviors and
extremely low probabilities of device failures, the consideration of the stochastic
aspects are inevitable in certain cases, including multiple clock domain
applications, which may result in metastability.
• In this section, we study some of the stochastic aspects of digital elements, such
as flip-flops and robust design methods that reduce the probability of metastability
and failure of digital systems.
Metastability
Metastability can occur when:
1. A flip-flop’s slack timing is
violated (high clock rate)
2. The data input to a flip-flop is
asynchronous to the clock
(leading to setup or hold-time
violations)
3. When using multiple un-
synchronized clock domains.
Metastability Examples
Ref: Stavinov, Evgeni. 100 power tips for FPGA designers. Evgeni Stavinov, 2011
427
• Reference: Ginosar, Ran. "Metastability and synchronizers: A tutorial." IEEE Design &
Test of Computers 28.5 (2011): 23-35.
429
where:
• fC : system clock rate (Flip-Flop clock)
• fD : (asynchronous) input data clock rate
• W: metastability window length constant
• τ: metastability time constant
• t MET : time delay for the metastability to resolve itself
Note: W and τ are constants depending on the setup-time and hold-time of the device
(vendor and technology dependent)
431
MTBF Calculation
Example 1: Consider a 28nm ASIC high-performance CMOS with
W=20ps and 𝜏=10ps (typical values for this process technology).
Assuming fC =1GHz and fD =100MHz, we find MTBF=4x1029 years
for a single-stage synchronizer (the universe is estimated to be 1010
years old).
432
How many synchronizer stages are required? The parameters W and τ are
commonly provided by IC manufacturers; fC and fD are also known by-design. The
designer can define a desired MTBF, calculate t MET and decide about the number of
required stages to fulfil the required MTBF.
434
Metastability Guidelines
Avoiding metastability (by design):
1. Avoiding real-time data transfer between different clock domains
2. Using a single global clock instead of multiple clock domains
3. Avoiding gated clocks and using standard clock decreasing techniques (using
clock enable)
Solving metastability (by implementation):
1. Clock synchronization using DCMs
2. Using synchronizers (register chains and asynchronous FIFOs) to reduce the
probability of metastability
Note: These methods only resolve metastability; but do not solve other rate
mismatch issues, when transferring data between different clock domains. For
example, sampling a data that changes with fD=80MHz, at a clock rate of
fC=100MHz, results in regular repeated samples and sampling it at fC=60MHz
results in regular data loss (even without metastability).
435
FIFO
Ref: Stavinov, Evgeni. 100 power tips for FPGA designers. Evgeni Stavinov, 2011
436
Ref: Stavinov, Evgeni. 100 power tips for FPGA designers. Evgeni Stavinov, 2011
437
synchronization
register chain
unpredictable
routing delays
Ref: https://china.xilinx.com/support/documentation/application_notes/xapp094.pdf
Note: Xilinx doesn’t seem to list the FF MTBF of its newer devices; but it reports them in Vivado® during
implementation.
440
Ref: ftp://ftp.altera.com/pub/lit_req/document/an/an042.pdf
443
Introduction**
• Complex FPGA-based systems can contain multiple units
(modules), each having multiple operation modes that are
selected by appropriate control pins (or control bus) and give
output messages in different occasions (handshakes, error
codes, overflow flags, etc.)
• Each element of a design should have a unique address in the
system’s memory map, which can be accessed via proper
commands
• In mixed CPU-FPGA systems, the internal memory map of the
FPGA is commonly accessible by the software units
• The design of a memory map is discussed in this section by
examples
FMC110
ML605
448
Memory Map
FPGA Ethernet PC
Command
FPGA Dispatcher commands/messages/
variable parameters
local
Submodule 1 Submodule 2 command
data path bus
Module 1
Common Bus
local command (commands/messages/
dispatchers
Bus Handler variable parameters)
456
Introduction
• As with other aspects of FPGA designs, data transfer
inside FPGA and between FPGA systems can be fully
customized.
• In this section we review the most common techniques
used for data transfer in FPGA designs
• The two classes of data transfer methods that we study
are:
• Stream Transfer
• Packet Transfer
459
Output Buffer 1
Input Buffer 1
in read
in write mode
mode
switch
switch
switch
switch
Block Data
Continuous Processor Continuous
Output Buffer 2
Input Buffer 2
…
475
Further Examples
• CORDIC core generators
• LFSR generators
• Fast Fourier Transform (FFT) architecture generator
FPGA DESIGN
DOCUMENTATION
481
Hardware Documentation**
• Design documentation is a necessary and essential part
of any engineering project
• Both specific and general documentation tools and
techniques can be used for hardware documentation
• Some of these techniques and tools are reviewed in this
section by example: Doxygen, LaTeX, etc.
References:
1. Stavinov, E. (2011). 100 power tips for FPGA designers. Evgeni
Stavinov.
2. Xilinx Power Solutions http://xilinx.com/power
3. Seven Steps to an Accurate Power Estimation using XPE, Xilinx White
Paper WP353
http://www.xilinx.com/support/documentation/white_papers/wp353.pdf
4. XPower User Guide, Xilinx User Guide UG440
http://www.xilinx.com/support/documentation/user_guides/ug440.pdf