SW Lab 3 Fixed Point Simulation EE 462
SW Lab 3 Fixed Point Simulation EE 462
SW Lab 3 Fixed Point Simulation EE 462
Lab Goals
• Understand fixed-point representation of numbers and fixed-point computations.
• Compare fixed-point and floating-point implementation of a simple function through simulation in
Matlab
• Understand subtle issues related to quantization of coefficients in a FIR filter
• Handling overflow issues.
1 Introduction
The processor you will use in the hardware labs is TMS320C5515, is a fixed-point processor. Fixed-point
arithmetic is generally used when hardware resources are limited and we can afford a reduction in accuracy in
return for higher execution speed. Fixed-point processors are either 16-bit or 24-bit devices, while floating point
processors are usually 32-bit devices. A typical 16-bit processor such as the TMS320C55x, stores data as a 16-bit
integer or a fraction format in a fixed range. Although signals are only stored with 16-bits, intermediate values
during the arithmetic operations may be kept at 32-bits. This is done using the internal 40-bit accumulator,
which helps reducing cumulative round-off errors. Fixed-point DSP devices are usually cheaper and faster than
their floating-point counterparts because they use less silicon, have lower power consumption, and require fewer
external pins. Most high volume low cost embedded applications, such as appliance control, cellular phones,
hard disk drives, modems, audio players, and digital cameras use fixed-point processors.
Floating-point arithmetic greatly expands the dynamic range of numbers. A typical 32-bit floating point DSP
processor, such as the TMS320C67x, represents number with a 24-bit mantissa and an 8-bit exponent. The
mantissa represents a fraction in the range -1.0 to +1.0, while the exponent is an integer that represents the
number of places that the binary point must be shifted left or right in order to obtain the true value. For
example, in decimal number system we have some number like 10.2345. We can write this in the form of
0.102345 × 102 . So in this format, 0.102345 is called mantissa and 2 is called exponent.
A 32-bit floating-point format covers a large dynamic range. Thus the data dynamic range restriction may be
virtually ignored in a design using floating-point DSP processors. But in fixed-point format, the designer has
to apply scaling factors and other techniques to prevent arithmetic overflow. This is usually a difficult and
time consuming process. As a result, floating-point DSP processors are generally easy to program and use,
but are more expensive and have higher power consumption. In this session, you will learn the issues related
to fixed-point arithmetic.
2 Fixed-point Formats
Two’s complement Integer Representation
This representation is most common method for representing a signed integer. The two’s complement of a binary
number is defined as the value obtained by subtracting the number from a large power of two (specifically,
from 2N for an N-bit two’s complement). i.e., a negative number x is represented as 2N + x. With N = 4,
x = (−6)(10) is represented as 16 + (−6) = (10)(10) = 1010(2) . As far as the hardware is concerned, fixed-point
number systems represent data as B-bit integers. The two’s-complement number system usually used is:
(
binary integer representation, if 0 ≤ k ≤ 2B−1 − 1
k= (1)
(bitwise complement of k) + 1, if − (2B−1 ) ≤ k ≤ 0
1
The most significant bit is known as the sign bit. It is 0 when the number is non-negative and 1 when the
number is negative.
Figure 1 is an easy way to visualize two’s complement representation.
The word-length is B(= M + 1) bits, i.e. M magnitude bits and one sign bit. The most significant bit (MSB)
is the sign bit, which represents the sign of the number as follows.
(
0, if x ≥ 0
b0 = (2)
−1, otherwise
The remaining M bits give the magnitude of the number. The rightmost bit bM is called the least significant
bit (LSB), which represents precision of the number.
The decimal value corresponding to a binary fractional number x can be expressed as,
M
X
x = b0 + bm 2−m (4)
m=1
2 of 7
Figure 3 provides a visual representation for a 3 bit fractional representation.
For example, from Figure 2, the easiest way to convert a normalized 16-bit fractional binary number into an
integer that can be used by C55x assembler is to move the binary point to the right by 15 bits (at the right of
bM ). Since shifting the binary point 1 bit right is equivalent to multiplying the fractional number by 2, moving
the binary point to the right by 15 bits can be done by multiplying the decimal value by 215 = 32768.
Q-format
An alternate format for storing rational numbers is Qn.m format as illustrated in Figure 4. There are n bits
to the left of the binary point representing integer portion, and m bits to the right, representing a fractional
value. We have N=n+m+1, where N = word length in bits (ex. 16).
The choice of n and m control the trade-off between range and resolution of the representation. The most
popular fractional number format is Q0.15 format (n = 0 and m = 15) , which is simply referred to as Q15
format since there are 15 fractional bits.
3 of 7
4 Round-off Error in Fixed Point Implementation
Fixed-point arithmetic is often used with DSP hardware for real-time processing because it offers fast operation
and relatively economical implementation. Its drawbacks include a small dynamic range and low resolution. A
fixed-point representation also leads to arithmetic errors, which if not handled correctly will lead to erroneous
operations. These errors have to be handled in such a way as to have some trade-off among accuracy, speed
and cost.
Consider the multiplication of two binary fractions, as shown in figure4. From above calculation, we see that
full-precision multiplication almost doubles the number of bits. If we wish to return the product to a b-bit
representation, we must truncate the (b − 1) least significant bits. However this introduces truncation error.
Truncation error is also known as round off error (the number is rounded to the nearest b-bit fractional value
rather than truncated). This type of error occurs after the multiplication.
4 of 7
Another useful and important aspect that needs to be associated with a fixed-point object created using fi is
its behaviour during arithmatic operations. This is controlled by the fimath setting, which will be discussed
in the next section. This is done as shown next,
>> fm = fimath;
>> x = 3.147;
>> xf = fi(x, 1, 16, 2, fm);
In this example, we are using the default settings in fimath. So this behaves similar to the earlier case where
we did not specify fm, while creating the fixed-point object xf. Note you can use the object to access several
of its properties using the ’dot’ form as shown next.
>> xf.bin
or
>> xf.WordLength
Try typing the tab-key after placing a ’.’ to see other properties that can be useful.
5 of 7
will display the binary form of xf and other fixed-point properties.
6 Lab Exercise
6.1 Computing y = a x + b
1. We will compare fixed-point computation and floating-point computation by considering the operation
y = a x + b . Go through the given Matlab file floating sim.m and run it. This is full-precision
computation using Matlab’s default double-precision representation. Note down the range of values each
for the variables of interest to us. These are to be used to identify the required fraction and integer
length in the fixed-point simulation.
2. Use the fixed sim template.m to simulate the computation using fixed-point arithmetic. Choose frac-
tion lengths for the variables in the file, so that your fixed-point result (yf) closely matches the floating-
point result (y) in previous section. You need to complete the code with appropriate options to perform
valid fixed-point computations. The comments in the file can be used to complete the code. Compare
the floating point result with fixed-point result by plotting the error or difference between y and yf.
3. For this part use the file overflow sim template.m. The computation is similar to previous parts, but
the result will overflow. Hence, appropriately set the fixed-point overflow mode in addition to appropriate
choice of fraction lengths. Compare the floating-point result with fixed-point result by plotting the error
or difference between y and yf. This needs to be done for two different values of b as can be seen in the
template file.
6 of 7
2. Use the Set quantization parameters option in the FD, to use fixed-point arithmetic. Use word-length
(W) of 16 and fraction-length (F) 15. In the Input/Output options, set word length to 16 and input
range to be ±1. All other options can be left as is. Design this quantized filter by pressing the Apply
button. You will see the frequency responses Lowpass Reference and Lowpass Quantized overlaid in
the frequency response panel. Export these coefficients as b 16.
3. Do as in previous part, but use W = 8, and F = 7. Export these coefficients as b 8.
4. Now use W = 4 and F = 3. Export these coefficients as b 4.
5. Save all these coefficients in a MAT file, so that you can compare the responses.
>> save(’LPF_coefficients’, ’b_ref’, ’b_16’, ’b_8’, ’b_4’)
6. Demonstrate to your TA that the filters meet the specifications. Also indicate the difference in gain/attenuation
due to use of quantized coefficients.
We will use the above filters to filter a signal that is sum of two sinusoidal input signals, to verify the filter
performance. You can use the provided FIR template.m to complete this part.
1. Use the FIR template.m file to filter a signal x, which is sum of two sinusoids x1 and x2. Signal x1 has
frequency f1 = 200 Hz and x2 has frequency f2 = 2000 Hz.
2. Filter the signal x through each of the filters designed to obtain the outputs y ref, y16, y8, and y4.
Verify that the output signals are as expected. Plot and compare the output signals and the error signals
(|y ref - y16|, |y ref - y8|, and |y ref - y4|). Demonstrate this to your TA.
3. Compare the mean-square errors (MSE) of the various outputs y16, y8, and y4 when compared to the
double-precision output y ref. While doing this, you can ignore the initial outputs depending on the
filter length. Compare the three MSEs in terms of dB and show these to your TA.
7 of 7