ADI - User Manual 1
ADI - User Manual 1
ADI - User Manual 1
a
ADSP-21000 Family
Application Handbook Volume 1
1994 Analog Devices, Inc.
ALL RIGHTS RESERVED
PRODUCT AND DOCUMENTATION NOTICE: Analog Devices reserves the right to change this product
and its documentation without prior notice.
PRINTED IN U.S.A.
Printing History
FIRST EDITION 5/94
For marketing information or Applications Engineering assistance, contact your local
Analog Devices sales office or authorized distributor.
If you have suggestions for how the ADSP-2100 Family development tools or
documentation can better serve your needs, or you need Applications Engineering
assistance from Analog Devices, please contact:
Analog Devices, Inc.
DSP Applications Engineering
One Technology Way
Norwood, MA 02062-9106
Tel: (617) 461-3672
Fax: (617) 461-3010
e-mail: dsp_applications@analog.com
The System IC Products Division runs a Bulletin Board Service that can be reached at
speeds up to 14,400 baud, no parity, 8 bits data, 1 stop bit, dialing (617) 461-4258.
This BBS supports: V.32bis, error correction (V.42 and MNP classes 2, 3, and 4), and
data compression (V.42bis and MNP class 5)
The System IC Products Division Applications Group maintains an Internet FTP site. Login
as anonymous using your email address for your password. Type (from your UNIX
prompt):
For additional marketing information, call (617) 461-3881 in Norwood MA, USA.
Literature
ADSP-21000 FAMILY MANUALS
ADSP-21020 User’s Manual
ADSP-21000 SHARC Preliminary Users Manual
Complete description of processor architectures and system interfaces.
SPECIFICATION INFORMATION
ADSP-21020 Data Sheet
ADSP-2106 SHARC Preliminary Data Sheet
ADSP-21000 Family Development Tools Data Sheet
Contents
CHAPTER 1 INTRODUCTION
v
Contents
vi 2 – 2
Contents
2 – 3 vii
Contents
viii 2 – 4
Contents
2 – 5 ix
Contents
CHAPTER 8 GRAPHICS
x 2 – 6
Contents
FIGURES
2 – 7 xi
Contents
TABLES
LISTINGS
2 – 9 xiii
Contents
1
1 Introduction
1.2.2 Hardware Development Tools
Analog Devices offers several systems that let you test your programs on
real hardware without spending time hardware prototyping, as well as
help you debug your target system hardware.
1.2.2.1 EZ-LAB
EZ-LAB® evaluation boards are complete ADSP-210xx systems that
include memory, an audio codec, an analog interface, and expansion
connectors on a single, small printed-circuit board. Several programs are
included that demonstrate signal processing algorithms. You can
download your own programs to the EZ-LAB from your IBM-PC
compatible computer.
EZ-LAB connects with EZ-ICE (described in the next section) and an IBM-
PC compatible to form a high-speed, interactive DSP workstation that lets
you debug and execute your software without prototype hardware.
1.2.2.2 EZ-ICE
EZ-ICE® in-circuit emulators give you an affordable alternative to large
dedicated emulators without sacrificing features. The EZ-ICE software
runs on an IBM-PC and gives you a debugging environment very similar
to the ADSP-210xx simulator. The EZ-ICE probe connects to the PC with
an ISA plug-in card and to the target system through a test connector on
the target. EZ-ICE communicates to the target processor through the
processor’s JTAG test access port. Your software runs on your hardware at
full speed in real time, which simplifies hardware and software
debugging.
Several third party companies also provide products that support ADSP-
21000 family development; contact Analog Devices for a complete list.
Here are a few of the products available as of this writing:
2
Introduction 1
1.2.4 DSPatch
• The files on the DSP BBS are also available by anonymous ftp, at
ftp.analog.com (132.71.32.11) , in the directory /pub/dsp .
3
1 Introduction
1.2.6 ADSP-21000 Family Classes
Applications Engineering regularly offers a course in ADSP-21000 family
architecture and programming. Please contact Applications Engineering
for a schedule of upcoming courses.
1.3.2.1 Precision
4
Introduction 1
1.3.2.2 Dynamic Range
Traditionally, compression and decompression algorithms have operated
on signals of known bandwidth. These algorithms were developed to
behave regularly, to keep costs down and implementations easy.
Increasingly, the trend in algorithm development is to remove constraints
on the regularity and dynamic range of intermediate results. Adaptive
filtering and imaging are two applications requiring wide dynamic range.
1.3.2.4 Ease-Of-Use
Ideally, floating-point digital signal processors should be easier to use and
allow a quicker time-to-market than DSPs that do not support floating-
point formats. If the floating-point processor’s architecture is designed
properly, designers can spend time on algorithm development instead of
assembly coding, code paging, and error handling. The following features
are hallmarks of a good floating-point DSP architecture:
5
1 Introduction
1.3.3.1 Fast & Flexible Arithmetic
The ADSP-210xx can execute all instructions in a single cycle. It provides
one of the fastest cycle times available and the most complete set of
arithmetic operations, including Seed 1/X, Seed 1/R(x), Min, Max, Clip,
Shift and Rotate, in addition to the traditional multiplication, addition,
subtraction and combined addition/subtraction. It is IEEE floating-point
compatible and allows either interrupt on arithmetic exception or latched
status exception handling.
6
Introduction 1
1.4 ADSP-21000 FAMILY ARCHITECTURE OVERVIEW
The following sections summarize the basic features of the ADSP-21020
architecture. These features are also common to the ADSP-21060 SHARC
processor; SHARC-specific enhancements to the base architecture are
discussed in the next section.
• Instruction Cache
The ADSP-210xx includes a high performance instruction cache that
enables three-bus operation for fetching an instruction and two data
values. The cache is selective—only the instructions whose fetches
conflict with PM bus data accesses are cached. This allows full-speed
execution of looped operations such as digital filter multiply-
accumulates and FFT butterfly processing.
7
1 Introduction
8
Introduction 1
TIMER CACHE
32 x 48
JTAG
TEST &
EMULATION
DAG 1 DAG 2 PROGRAM
8 x 4 x 32 8 x 4 x 24 SEQUENCER
FLAGS
PM ADDRESS BUS 24
DM ADDRESS BUS 32
48
PM DATA BUS
Bus
Connect 40
DM DATA BUS
REGISTER
FILE
16 x 40
MULTIPLIER BARREL
ALU
SHIFTER
1×
CLOCK
4
Selects
PMTS DMTS
OE
PMPAGE DMPAGE WE
PMACK DMACK ACK PERIPHERALS
FLAG3-0
TIMEXP
RCOMP
ADDR
JTAG
BG
BR
DATA
4 5
9
1 Introduction
1.4.3 ADSP-21060 SHARC
The ADSP-21060 SHARC (Super Harvard Architecture Computer) is a
single-chip 32-bit computer optimized for signal computing applications.
The ADSP-21060 SHARC has the following key features:
DMA Controller
10
Trigonometric, Mathematical & 2
Transcendental Functions
Trigonometric
• sine/cosine approximation
• tangent approximation
• arctangent approximation
Mathematical
• square root
• square root with single precision
• inverse square root
• inverse square root with single precision
• division
Transcendental
• logarithm
• exponential
• power
15
2 Trigonometric, Mathematical &
Transcendental Functions
Let
|x| = Nπ + f
where
|f| ≤ π/2.
Then
2.1.1 Implementation
The two listings illustrate the sine approximation and the calling of the
sine approximation. The first listing, sin.asm , is an implementation of
the algorithm for calculation of sines and cosines. The second listing,
sinetest.asm , is an example of a program that calls the sine
approximation.
16
Trigonometric, Mathematical & 2
Transcendental Functions
Implementation of the sine algorithm on ADSP-21000 family processors is
straightforward. In the first listing below, sin.asm , two segments are
defined. The first segment, defined with the .SEGMENT directive, contains
the assembly code for the sine/cosine approximation. The second segment
is a data segment that contains the constants necessary to perform this
approximation.
There are two entry points in the subroutine, sin.asm . They are labeled
cosine and sine . Code execution begins at these labels. The calling
program uses these labels by executing the instruction
with the argument x in register F0. These calls are delayed branch calls
that efficiently use the instruction pipeline on the ADSP-21000 family. In a
delayed branch, the two instructions following the branch instruction are
executed prior to the branch. This prevents the need to flush an instruction
pipeline before taking a branch.
17
2 Trigonometric, Mathematical &
Transcendental Functions
2.1.2 Code Listings
2.1.2.1 Sine/Cosine Approximation Subroutine
/***************************************************************************
File Name
SIN.ASM
Version
0.03 7/4/90
Purpose
Subroutine to compute the Sine or Cosine values of a floating point input.
Equations Implemented
Y=SIN(X) or
Y=COS(X)
Calling Parameters
F0 = Input Value X=[6E-20, 6E20]
l_reg=0
Return Values
F0 = Sine (or Cosine) of input Y=[-1,1]
Registers Affected
F0, F2, F4, F7, F8, F12
i_reg
Cycle Count
38 Cycles
# PM Locations
34 words
# DM Locations
11 Words
***************************************************************************/
18
Trigonometric, Mathematical & 2
Transcendental Functions
#include “asm_glob.h”
.SEGMENT/PM Assembly_Library_Code_Space;
.PRECISION=MACHINE_PRECISION;
cosine:
i_reg=sine_data; /*Load pointer to data*/
F8=ABS F0; /*Use absolute value of input*/
F12=0.5; /*Used later after modulo*/
F2=1.57079632679489661923; /* and add PI/2*/
sine:
i_reg=sine_data; /*Load pointer to data*/
F7=1.0; /*Assume a positive sign*/
F12=0.0; /*Used later after modulo*/
F8=ABS F0, F2=mem(i_reg,1);
F0=PASS F0, F4=F8;
IF LT F7=-F7; /*If input was negative, invert
sign*/
compute_modulo:
F4=F4*F2; /*Compute fp modulo value*/
R2=FIX F4; /*Round nearest fractional portion*/
BTST R2 BY 0; /*Test for odd number*/
IF NOT SZ F7=-F7; /*Invert sign if odd modulo*/
F4=FLOAT R2; /*Return to fp*/
F4=F4-F12, F2=mem(i_reg,1); /*Add cos adjust if necessary,
F4=XN*/
compute_f:
F12=F2*F4, F2=mem(i_reg,1); /*Compute XN*C1*/
F2=F2*F4, F12=F8-F12; /*Compute |X|-XN*C1, and
XN*C2*/
F8=F12-F2, F4=mem(i_reg,1); /*Compute f=(|X|-XN*C1)-
XN*C2*/
F12=ABS F8; /*Need magnitude for test*/
F4=F12-F4, F12=F8; /*Check for sin(x)=x*/
IF LT JUMP compute_sign; /*Return with result in F1*/
compute_R:
F12=F12*F12, F4=mem(i_reg,1);
19
2 Trigonometric, Mathematical &
Transcendental Functions
LCNTR=6, DO compute_poly UNTIL LCE;
F4=F12*F4, F2=mem(i_reg,1); /*Compute sum*g*/
compute_poly:
F4=F2+F4; /*Compute sum=sum+next r*/
F4=F12*F4; /*Final multiply by g*/
RTS (DB), F4=F4*F8; /*Compute f*R*/
F12=F4+F8; /*Compute Result=f+f*R*/
compute_sign:
F0=F12*F7; /*Restore sign of result*/
RTS; /*This return only for sin(eps)=eps
path*/
.ENDSEG;
.SEGMENT/SPACE Assembly_Library_Data_Space;
.PRECISION=MEMORY_PRECISION;
.VAR sine_data[11] =
0.31830988618379067154, /*1/PI*/
3.14160156250000000000, /*C1, almost PI*/
-8.908910206761537356617E-6, /*C2, PI=C1+C2*/
9.536743164E-7, /*eps, sin(eps)=eps*/
-0.737066277507114174E-12, /*R7*/
0.160478446323816900E-9, /*R6*/
-0.250518708834705760E-7, /*R5*/
0.275573164212926457E-5, /*R4*/
-0.198412698232225068E-3, /*R3*/
0.833333333327592139E-2, /*R2*/
-0.166666666666659653; /*R1*/
.ENDSEG;
20
Trigonometric, Mathematical & 2
Transcendental Functions
Listing 2.1 sin.asm
File Name
SINTEST.ASM
Purpose
Example calling routine for the sine function.
*********************************************************************************/
#include “asm_glob.h”;
#include “def21020.h”;
#define N 4
#define PIE 3.141592654
.SEGMENT/PM
pm_rsti; /* The reset vector resides in this space */
DMWAIT=0x21; /* Set data memory waitstates to zero */
PMWAIT=0x21; /* Set program memory waitstates to zero */
JUMP start;
.ENDSEG;
.EXTERN sine;
.SEGMENT/PM pm_code;
start:
bit set mode2 0x10; nop; read cache 0; bit clr mode2 0x10;
M1=1;
B0=input;
L0=0;
I1=output;
L1=0;
end: 21
2 Trigonometric, Mathematical &
Transcendental Functions
IDLE;
.ENDSEG;
2.2.1 Implementation
The implementation of the tangent approximation algorithm uses 38
instruction cycles and consists of three logical steps.
Notice that in the argument reduction, the assembly instructions are all
multifunction instructions. ADSP-21000 family processors can execute a
data move or a register move in parallel with a computation. Because
multifunction instructions execute in a single cycle, the overhead for the
memory move is eliminated.
22
Trigonometric, Mathematical & 2
Transcendental Functions
A special case is if tan(x) = x. This occurs when the absolute value of f is
less than epsilon. This value is very close to 0. In this case, a jump is
executed and the tangent function is calculated using the values of f and 1
in the final divide.
The third step in the calculation of the tangent function is to divide P(g) by
Q(g). If the argument, f, is even, then the tangent function is
tan(f) = f * P(g)/Q(g)
tan(f) = –f * P(g)/Q(g)
23
2 Trigonometric, Mathematical &
Transcendental Functions
Similarly, the cotangent function is easily calculated by inverting the polynomial
File Name
TAN.ASM
Version
Version 0.03 7/5/90
Purpose
Subroutine to compute the tangent of a floating point input.
Equations Implemented
Y=TAN(X)
Calling Parameters
F0 = Input Value X=[6E-20, 6E20]
l_reg = 0 (usually L3)
Return Values
F0 = tangent of input X
Registers Affected
F0, F1, F2, F4, F7, F8, F12
i_reg (usually I3)
Cycle Count
38 Cycles
24
Trigonometric, Mathematical & 2
Transcendental Functions
# PM Locations
# DM Locations
*********************************************************************************/
#include “asm_glob.h”
.SEGMENT/PM Assembly_Library_Code_Space;
.PRECISION=MACHINE_PRECISION;
.GLOBAL tan;
tan: i_reg=tangent_data;
F8=PASS F0, F2=mem(i_reg,1); /* Use absolute value of input
*/
compute_modulo:
F4=F8*F2, F1=F0; /* Compute fp modulo value */
R2=FIX F4, F12=mem(i_reg,1); /* Rnd nearest fractional
portion */
F4=FLOAT R2, R0=R2; /* Return to fp */
compute_f:
F12=F12*F4, F2=mem(i_reg,1); /* Compute XN*C1 */
F2=F2*F4, F12=F8-F12; /* Compute X-XN*C1, and XN*C2
*/
F8=F12-F2, F4=mem(i_reg,1); /* Compute f=(X-XN*C1)-XN*C2
*/
F12=ABS F8, F7=F8;
F4=F12-F4, F12=mem(i_reg,1); /* Check for TAN(x)=x */
IF LT JUMP compute_quot; /* Compute quotient with NUM=f
DEN=1 */
compute_P:
F12=F8*F8, F4=mem(i_reg,1); /* g=f*f */
F4=F12*F4, F2=mem(i_reg,1); /* Compute p3*g */
F4=F2+F4; /* Compute (p3*g + p2) */
F4=F12*F4, F2=mem(i_reg,1); /* Compute (p3*g + p2)*g */
F4=F2+F4; /* Compute (p3*g + p2)*g + p1 */
F4=F12*F4; /* Compute
((p3*g + p2)*g + p1)*g */
F4=F4*F8; /* Compute
((p3*g + p2)*g +p1)*g*f */
F8=F4+F8, F4=mem(i_reg,1); /* Compute f*P(g) */
25
2 Trigonometric, Mathematical &
Transcendental Functions
compute_Q:
F4=F12*F4, F2=mem(i_reg,1); /* Compute sum*g */
F4=F2+F4; /* Compute sum=sum+next q */
F4=F12*F4, F2=mem(i_reg,1); /* Compute sum*g */
F4=F2+F4; /* Compute sum=sum+next q */
F4=F12*F4, F2=mem(i_reg,1); /* Compute sum*g */
F12=F2+F4, F7=F8; /* Compute sum=sum+next q */
compute_quot:
BTST R0 BY 0; /* Test LSB */
IF NOT SZ F12=-F7, F7=F12; /* SZ true if even value*/
F0=RECIPS F12; /* Get 4 bit seed R0=1/D */
F12=F0*F12, F11=mem(i_reg,1); /* D(prime) = D*R0 */
F7=F0*F7, F0=F11-F12; /* F0=R1=2-D(prime), F7=N*R0
*/
F12=F0*F12; /* F12=D(prime)=D(prime)*R1 */
F7=F0*F7, F0=F11-F12; /* F7=N*R0*R1, F0=R2=2-
D(prime) */
RTS (DB), F12=F0*F12; /* F12=D(prime)=D(prime)*R2 */
F7=F0*F7, F0=F11-F12; /* F7=N*R0*R1*R2,
F0=R3=2-D(prime)*/
F0=F0*F7; /* F7=N*R0*R1*R2*R3 */
.ENDSEG;
.SEGMENT/SPACE Assembly_Library_Data_Space;
.PRECISION=MEMORY_PRECISION;
.VAR tangent_data[13] = 0.6366197723675834308, /* 2/PI */
1.57080078125, /* C1, almost PI/2 */
-4.454455103380768678308E-6, /* C2, PI/2=C1+C2 */
9.536743164E-7, /* eps, TAN(eps)=eps */
1.0, /* Used in one path */
-0.7483634966612065149E-5, /* P3 */
0.2805918241169988906E-2, /* P2 */
-0.1282834704095743847, /* P1 */
26
Trigonometric, Mathematical & 2
Transcendental Functions
-0.2084480442203870948E-3, /* Q3 */
0.2334485282206872802E-1, /* Q2 */
-0.4616168037429048840, /* Q1 */
1.0, /* Q0 */
2.0; /* Used in divide */
.ENDSEG;
2.3.1 Implementation
This implementation of the tangent approximation algorithm uses 82
instruction cycles, in the worst case. It follows the three logical steps listed
above.
27
2 Trigonometric, Mathematical &
Transcendental Functions
H=atan(y/x)
arctan(x) = –arctan(–x)
where
f = (x – (R(3) – 1) / (R(3) + x)
R = g * P(g) / Q(g)
where
28
Trigonometric, Mathematical & 2
Transcendental Functions
and
Q(g) = (g + q1) * g + Q0
Notice that an eight-cycle macro, divide , is implemented for division. This macro is
used several times in the program.
The final step is to reconstruct the atan(x) from the atan(f) calculation.
File Name
ATAN.ASM
Version
Version 0.01 3/20/91
Purpose
Subroutine to compute the arctangent values of a floating point input.
Equations Implemented
atan- H=ATAN(Y)
atan2- H=ATAN(Y/X)
where H is in radians
Calling Parameters
F0 = Input Value Y=[6E-20, 6E20]
F1 = Input Value X=[6E-20, 6E20] (atan2 only)
l_reg=0
Return Values
F0 = ArcTangent of input
=[-pi/2,pi/2] for atan
=[-pi,pi] for atan2
Registers Affected
F0, F1, F2, F4, F7, F8, F11, F12, F15
i_reg
ms_reg
29
2 Trigonometric, Mathematical &
Transcendental Functions
Cycle Count
atan 61 Cycles maximum
atan2 82 cycles maximum
# PM Locations
# DM Locations
*********************************************************************************/
Register Usage:
q = f0-f15 Quotient
n = f4-f7 Numerator
d = f12-f15 Denominator
two = f8-f11 must have 2.0 pre-stored
tmp = f0-f3
Looping: none
Special Cases:
q may be any register, all others must be distinct.
Cycles: 8
Created: 3/19/91
--------------------------------------------------------------------------------
-----*/
#define DIVIDE(q,n,d,two,tmp) \
n=RECIPS d, tmp=n; /* Get 8 bit seed R0=1/D*/ \
d=n*d; /* D(prime) = D*R0*/ \
tmp=tmp*n, n=two-d; /* N=2-D(prime), TMP=N*R0*/ \
d=n*d; /* D=D(prime)=D(prime)*R1*/ \
tmp=tmp*n, n=two-d; /* TMP=N*R0*R1, N=R2=2-D(prime)*/ \
d=n*d; /* D=D(prime)=D(prime)*R2*/ \
tmp=tmp*n, n=two-d; /* TMP=N*R0*R1*R2, N=R3=2-D(prime)*/ \
q=tmp*n
30
Trigonometric, Mathematical & 2
Transcendental Functions
#include “asm_glob.h”
.SEGMENT/PM Assembly_Library_Code_Space;
.PRECISION=MACHINE_PRECISION;
atan2: i_reg=atan_data;
F11= 2.0;
F2= 0.0;
F1=PASS F1;
IF EQ JUMP denom_zero; /* if Denom. = 0, goto special
case*/
IF LT F2=mem(11,i_reg); /* if Denom. < 0, F2=pi (use at
end)*/
do_division: DIVIDE(F0,F7,F15,F11,F1);
JUMP re_entry (DB);
re_entry: F7 = 1.0;
COMP(F15,F7), F4=mem(0,i_reg); /* F4=2-sqrt(3)*/
IF LE JUMP tst_f; /* If input<=1, do arctan(input)*/
/* else do arctan(1/input)+const*/
DIVIDE(F15,F7,F15,F11,F1); /* do x=1/x*/
31
2 Trigonometric, Mathematical &
Transcendental Functions
tst_f: COMP(F15,F4); /* Note F4 prev. loaded from memory*/
IF LT JUMP tst_for_eps;
R10=R10+1, F4=mem(1,i_reg); /* F4=sqrt(3)*/
F12=F4*F15;
F7=F12-F7; /* F7=F12-1.0*/
F15=F4+F15;
DIVIDE(F15,F7,F15,F11,F1); /* = (sqrt(3)*x-1)/(x+sqrt(3))*/
32
Trigonometric, Mathematical & 2
Transcendental Functions
.ENDSEG;
.SEGMENT/SPACE Assembly_Library_Data_Space;
.PRECISION=MEMORY_PRECISION;
A square root exists (and can be calculated) for every non-negative floating-
point number. Calculating the square root of a negative number gives an
imaginary result. To calculate the square root of a negative number, take the
absolute value of the number and use sqrt(y) or isqrt(y) as defined in this
section. Remember that the result is really an imaginary number.
33
2 Trigonometric, Mathematical &
Transcendental Functions
Raphson iteration is then used for successively more accurate
approximations.
x = sqrt(y) = y * 1/sqrt(y)
For example, suppose that you need to calculate the square root of 30. If
you use 5.0 as an initial approximation of the square root, the inverse
square root approximation, 1/ 30, is 0.2. Therefore, given y=30 and
xn=0.2, one iteration yields
34
Trigonometric, Mathematical & 2
Transcendental Functions
Therefore the approximation of the inverse square root of 30 after three
iterations is 0.182574161. To calculate the square root, just multiply the
two numbers together
30 * 0.182574161 = 5.47722483
2.4.1 Implementation
To implement the Newton-Raphson iteration method for calculating an
inverse square root on an ADSP-21000 processor, the first task is to
calculate the initial low-accuracy approximation.
Note that the instruction RSQRTS only accepts inputs greater than zero.
A ±Zero returns ±Infinity, ±Infinity returns ±Zero, and a NAN (not-a-
number) or negative input returns an all 1's result. You can use
conditional logic to assure that the input value is greater than zero.
To calculate the seed for an input value stored in register F0, use the
following instruction:
35
2 Trigonometric, Mathematical &
Transcendental Functions
implemented as follows:
F12=F4*F4; /* F12=X0^2 */
F12=F12*F0; /* F12=C*X0^2 */
F4=F2*F4, F12=F8-F12; /* F4=.5*X0, F10=3-C*X0^2 */
F4=F4*F12; /* F4=X1=.5*X0(3-C*X0^2) */
The register F4 contains a reasonably accurate approximation for the inverse square
root. Successive iterations are made by repeating the above four lines or code. The
square root of F0 is calculated by multiplying the approximation F4, by the initial
input F0:
F0=F4*F0; /* X=sqrt(Y)=Y/sqrt(Y) */
There are four subroutine listings below that illustrate how to calculate sqrt(y) and
isqrt(y). Two are for single precision (24-bit mantissa), and two for extended precision
(32-bit mantissa).
36
Trigonometric, Mathematical & 2
Transcendental Functions
Version
Version 0.02 7/6/90
Purpose
Subroutine to compute the square root of x using the 1/sqrt(x) approximation.
Equations Implemented
X = sqrt(Y) = Y/sqrt(Y)
Calling Parameters
F0 = Y Input Value
F8 = 3.0
F1 = 0.5
Return Values
F0 = sqrt(Y)
Registers Affected
F0, F4, F12
Cycle Count
14 Cycles
# PM Locations
# DM Locations
***********************************************************************************/
#include “asm_glob.h”
37
2 Trigonometric, Mathematical &
Transcendental Functions
.SEGMENT/PM pm_code;
.PRECISION=MACHINE_PRECISION;
.GLOBAL sqrt;
sqrt:
F4=RSQRTS F0; /*Fetch seed*/
F12=F4*F4; /*F12=X0^2*/
F12=F12*F0; /*F12=C*X0^2*/
F4=F1*F4, F12=F8-F12; /*F4=.5*X0, F10=3-C*X0^2*/
F4=F4*F12; /*F4=X1=.5*X0(3-C*X0^2)*/
F12=F4*F4; /*F12=X1^2*/
F12=F12*F0; /*F12=C*X1^2*/
F4=F1*F4, F12=F8-F12; /*F4=.5*X1, F10=3-C*X1^2*/
F4=F4*F12; /*F4=X2=.5*X1(3-C*X1^2)*/
F12=F4*F4; /*F12=X2^2*/
F12=F12*F0; /*F12=C*X2^2*/
RTS (DB), F4=F1*F4, F12=F8-F12; /*F4=.5*X2, F10=3-C*X2^2*/
F4=F4*F12; /*F4=X3=.5*X2(3-C*X2^2)*/
F0=F4*F0; /*X=sqrt(Y)=Y/sqrt(Y)*/
.ENDSEG;
38
Trigonometric, Mathematical & 2
Transcendental Functions
Version
Version 0.02 7/6/90
Purpose
Subroutine to compute the inverse square root of x using the 1/
sqrt(x) approximation.
Equations Implemented
X = isqrt(Y) = 1/sqrt(Y)
Calling Parameters
F0 = Y Input Value
F8 = 3.0
F1 = 0.5
Return Values
F0 = isqrt(Y)
Registers Affected
F0, F4, F12
Cycle Count
13 Cycles
#PM Locations
#DM Locations
*********************************************************************************/
39
2 Trigonometric, Mathematical &
Transcendental Functions
#include “asm_glob.h”
.SEGMENT/PM pm_code;
.PRECISION=MACHINE_PRECISION;
.GLOBAL isqrt;
isqrt:
F4=RSQRTS F0; /*Fetch seed*/
F12=F4*F4; /*F12=X0^2*/
F12=F12*F0; /*F12=C*X0^2*/
F4=F1*F4, F12=F8-F12; /*F4=.5*X0, F10=3-C*X0^2*/
F4=F4*F12; /*F4=X1=.5*X0(3-C*X0^2)*/
F12=F4*F4; /*F12=X1^2*/
F12=F12*F0; /*F12=C*X1^2*/
F4=F1*F4, F12=F8-F12; /*F4=.5*X1, F10=3-C*X1^2*/
F4=F4*F12; /*F4=X2=.5*X1(3-C*X1^2)*/
F12=F4*F4; /*F12=X2^2*/
F12=F12*F0; /*F12=C*X2^2*/
RTS (DB), F4=F1*F4, F12=F8-F12; /*F4=.5*X2, F10=3-C*X2^2*/
F4=F4*F12; /*F4=X3=.5*X2(3-C*X2^2)*/
/* =isqrt(Y)=1/sqrt(Y)*/
.ENDSEG;
40
Trigonometric, Mathematical & 2
Transcendental Functions
Version
Version 0.02 7/6/90
Purpose
Subroutine to compute the square root of x to single-precision (24 bits
mantissa) using the 1/sqrt(x) approximation.
Equations Implemented
X = sqrtsgl(Y) = Y/sqrtsgl(Y)
Calling Parameters
F0 = Y Input Value
F8 = 3.0
F1 = 0.5
Return Values
F0 = sqrtsgl(Y)
Registers Affected
F0, F4, F12
Cycle Count
10 Cycles
41
2 Trigonometric, Mathematical &
Transcendental Functions
# PM Locations
# DM Locations
*********************************************************************************/
#include “asm_glob.h”
.SEGMENT/PM pm_code;
.PRECISION=MACHINE_PRECISION;
.GLOBAL sqrtsgl;
sqrtsgl:
F4=RSQRTS F0; /*Fetch seed*/
F12=F4*F4; /*F12=X0^2*/
F12=F12*F0; /*F12=C*X0^2*/
F4=F1*F4, F12=F8-F12; /*F4=.5*X0, F12=3-C*X0^2*/
F4=F4*F12; /*F4=X1=.5*X0(3-C*X0^2)*/
F12=F4*F4; /*F12=X1^2*/
F12=F12*F0; /*F12=C*X1^2*/
F4=F1*F4, F12=F8-F12; /*F4=.5*X1, F12=3-C*X1^2*/
F4=F4*F12; /*F4=X2=.5*X1(3-C*X1^2)*/
F0=F4*F0; /*X=sqrtsgl(Y)=Y/sqrtsgl(Y)*/
.ENDSEG;
42
Trigonometric, Mathematical & 2
Transcendental Functions
Version
Version 0.02 7/6/90
Purpose
Subroutine to compute the inverse square root of x to single-precision (24
bits mantissa) using the 1/sqrt(x) approximation.
Equations Implemented
X = isqrtsgl(Y) = 1/sqrtsgl(Y)
Calling Parameters
F0 = Y Input Value
F8 = 3.0
F1 = 0.5
Return Values
F0 = isqrtsgl(Y)
Registers Affected
F0, F4, F12
Cycle Count
9 Cycles
43
2 Trigonometric, Mathematical &
Transcendental Functions
# PM Locations
# DM Locations
*********************************************************************************/
#include “asm_glob.h”
.SEGMENT/PM pm_code;
.PRECISION=MACHINE_PRECISION;
.GLOBAL isqrtsgl;
isqrtsgl:
F4=RSQRTS F0; /*Fetch seed*/
F12=F4*F4; /*F12=X0^2*/
F12=F12*F0; /*F12=C*X0^2*/
F4=F1*F4, F12=F8-F12; /*F4=.5*X0, F12=3-C*X0^2*/
F4=F4*F12; /*F4=X1=.5*X0(3-C*X0^2)*/
F12=F4*F4; /*F12=X1^2*/
RTS(DB), F12=F12*F0; /*F12=C*X1^2*/
F4=F1*F4, F12=F8-F12; /*F4=.5*X1, F12=3-C*X1^2*/
F0=F4*F12; /*F4=X2=.5*X1(3-C*X1^2)*/
/* =isqrtsgl(Y)=1/sqrtsgl(Y)*/
.ENDSEG;
2.5 DIVISION
The ADSP-21000 family instruction set includes the RECIPS instruction
to simplify the implementation of floating-point division.
2.5.1 Implementation
The code performs floating-point division using an in iterative
convergence algorithm. The result is accurate to one LSB in whichever
format mode, 32-bit or 40-bit, is set (32-bit only for ADSP-21010). The
following inputs are required: F0 = numerator, F12= denominator,
F11 = 2.0. The quotients is returned in F0. (In the code listing, the two
highlighted instructions can be removed if only a ±1 LSB accurate single-
precision result is necessary.)
44
Trigonometric, Mathematical & 2
Transcendental Functions
unbiased Fx exponent, decremented by one; that is, if e is the unbiased exponent of
Fx, then the unbiased exponent of Fn = –e – 1. The sign of the seed is the sign of the
input. ±Zero returns ±Infinity and sets the overflow flag. If the unbiased exponent of
Fx is greater than +125, the result is ±Zero. A NAN input returns an all 1's result.
/*
File Name
F.ASM
Version
Version 0.03
Purpose
An implementation of division using an Iterative Convergent Divide
Algorithm.
Equations Implemented
Q = N/D
Calling Parameters
F0 = N Input Value
F12 = D Input Value
F11 = 2.0
Return Values
F0 = Quotient of input
Registers Affected
F0, F7, F12
Cycle Count
8 Cycles (6 Cycles for single precision)
# PM Locations
# DM Locations
45
2 Trigonometric, Mathematical &
Transcendental Functions
*/
#include “asm_glob.h”
.SEGMENT/PM Assembly_Library_Code_Space;
.PRECISION=MACHINE_PRECISION;
.GLOBAL divide;
F0=F0*F7; {F0=N*R0*R1*R2*R3}
.ENDSEG;
The algorithm can calculate logarithms of any base (base e, 2, or 10); the
46
Trigonometric, Mathematical & 2
Transcendental Functions
code is identical until step three, so only one assembly-language module is
needed.
The first step is to take a given floating point input, Y, and reduce it to
Y=f * 2N where 0.5 ≤f < 1. Given that X = log(Y), X also equals
X=loge(Y)
X=loge(f * 2N)
X=loge(f) + N*loge(2)
s = (f – 1) / (f + 1)
Then
X = loge(f) + N*loge(2)
yields the solution for the natural (base-e) log of the input Y. To compute
the base-2 or base-10 logarithm, the result X is multiplied by a constant
equal to the reciprocal of loge(2), or loge(10), respectively.
2.6.1 Implementation
LOGS.ASM is an assembly-language implementation of the logarithm
algorithm. This module has three entry points; a different base of
logarithm is computed depending on which entry point is used. The label
LOG is used for calling the algorithm to approximate the natural (base-e)
logarithm, while the labels LOG2 and LOG10 are used for base-2 and
base-10 approximations, respectively. When assembling the file
LOGS.ASM you can specify where coefficients are placed—either in data
memory (DM) or program memory (PM)— by using the -Didentifier
47
2 Trigonometric, Mathematical &
Transcendental Functions
switch at assembly time.
For example, to place the coefficients in data memory use the syntax
The first step to compute any of the desired logarithms is to reduce the
floating point input Y to the form
Y=f * 2N
For example, consider the floating point number 12.0. Using the IEEE
standard, this number is represented as 0.5 * 2131. By adding the hidden
one and unbiasing the exponent, 12.0 is actually represented as 1.5 * 24. To
get to the format f * 2N where 0.5 ≤ f < 1, you must scale 1.5 * 24 by two to
get the format 0.75 * 23. The instruction LOGB extracts the exponent from
our floating-point input. The exponent is then decremented, and the
mantissa is scaled to achieve the desired format.
Use the value of f to approximate the value of the auxiliary variable z. The
variable z is approximated using the following formula
z=znum/zden
where
and
48
Trigonometric, Mathematical & 2
Transcendental Functions
or
R(z) = z + z * r(z2)
The rational approximation r(z2) has been derived and for w=z2 is
r(z2) = w * A(w)/B(w)
where A(w) and B(w) are polynomials in w, with derived coefficients a0,
a1, and b0
A(w) = a1 * w + a0
B(w) = w + b0
R(z) is the approximation of loge(f) and the final step in the approximation
of loge(Y) is to add in N*loge(2). The coefficients C0 and C1 and the
exponent N are used to determine this value.
If only the natural logarithm is desired, then the algorithm is complete and
the natural log (ln(Y)) is returned in the register F0. If log2(Y) was needed,
then F0 is multiplied by 1/ln(2) or 1.442695041. If log10(Y) is needed, then
F0 is multiplied by 1/ln(10) or 0.43429448190325182765.
At assembly time, you must specify the memory space where the eight
coefficients are stored (Program or Data Memory ) by using either the
-DDM_DATA or -DPM_DATA switch. Attempts to assemble LOGS.ASM
without one of these switches result in an error.
49
2 Trigonometric, Mathematical &
Transcendental Functions
/***************************************************************
File Name
LOGS.ASM
Version
Version 0.03 8/6/90
revised 26-APR-91
Purpose
Subroutine to compute the logarithm (bases 2,e, and 10) of its floating
point input.
Equations Implemented
Y=LOG(X) or
Y=LOG2(X) or
Y=LOG10(X)
Calling Parameters
F0 = Input Value
l_reg=0;
Return Values
F0 = Logarithm of input
Registers Affected
F0, F1, F6, F7, F8, F10, F11, F12
i_reg
Computation Time
49 Cycles
#PM locations
#DM locations
***************************************************************/
#include “asm_glob.h”
.SEGMENT/PM Assembly_Library_Code_Space;
.PRECISION=MACHINE_PRECISION;
.GLOBAL log, log10, log2;
log2:
CALL logs_core (DB); /*Enter same routine in two cycles*/
R11=LOGB F0, F1=F0; /*Extract the exponent*/
F12=ABS F1; /*Get absolute value*/
RTS (DB);
F11=1.442695041; /*1/Log(2)*/
50
Trigonometric, Mathematical & 2
Transcendental Functions
F0=F11*F0; /*F0 = log2(X)*/
log10:
CALL logs_core (DB); /*Enter same routine in two cycles*/
R11=LOGB F0, F1=F0; /*Extract the exponent*/
F12=ABS F1; /*Get absolute value*/
RTS (DB);
F11=0.43429448190325182765; /*1/Log(10)*/
F0=F11*F0; /*F12 = log10(X)*/
log:
R11=LOGB F0, F1=F0;
F12=ABS F1;
adjust_z:
F7=F7-F10; /*znum = f - .5 - .5*/
F8=F12*F10; /*f * .5*/
F12=F8+F10; /*zden = f * .5 + .5*/
compute_r:
F0=RECIPS F12; /*Get 4 bit seed R0=1/D*/
F12=F0*F12, F10=mem(i_reg,1); /*D(prime) = D*R0*/
F7=F0*F7, F0=F10-F12; /*F0=R1=2-D(prime), F7=N*R0*/
F12=F0*F12; /*F12=D(prime)=D(prime)*R1*/
F7=F0*F7, F0=F10-F12; /*F7=N*R0*R1, F0=R2=2-D(prime)*/
F12=F0*F12; /*F12=D(prime)=D(prime)*R2*/
F7=F0*F7, F0=F10-F12; /*F7=N*R0*R1*R2, F0=R3=2-D(prime)*/
F6=F0*F7; /*F7=N*R0*R1*R2*R3*/
F0=F6*F6, F8=mem(i_reg,1); /*w = z^2*/
F12=F8+F0, F8=mem(i_reg,1); /*B(W) = w + b0*/
F7=F8*F0, F8=mem(i_reg,1); /* w*a1*/
F7=F7+F8, F8=F0; /*A(W) = w * a1 + a0*/
F0=RECIPS F12; /*Get 4 bit seed R0=1/D*/
F12=F0*F12; /*D(prime) = D*R0*/
F7=F0*F7, F0=F10-F12; /*F0=R1=2-D(prime), F7=N*R0*/
F12=F0*F12; /*F12=D(prime)=D(prime)*R1*/
F7=F0*F7, F0=F10-F12; /*F7=N*R0*R1, F0=R2=2-D(prime)*/
51
2 Trigonometric, Mathematical &
Transcendental Functions
F12=F0*F12; /*F12=D(prime)=D(prime)*R2*/
F7=F0*F7, F0=F10-F12; /*F7=N*R0*R1*R2, F0=R3=2-
D(prime)*/
F7=F0*F7; /*F7=N*R0*R1*R2*R3*/
F7=F7*F8; /*Compute r(z^2)=w*A(w)/B(w)*/
compute_R:
F7=F6*F7; /* z*r(z^2)*/
F12=F6+F7; /*R(z) = z + z * r(z^2)*/
F0=FLOAT R11, F7=mem(i_reg,1); /*F0=XN, F7=C2*/
F10=F0*F7, F7=mem(i_reg,1); /*F10=XN*C2, F7=C1*/
RTS (DB);
F7=F0*F7, F0=F10+F12; /*F0=XN*C2+R(z),
F7=XN*C1*/
F0=F0+F7; /*F0 = ln(X)*/
.ENDSEG;
.SEGMENT/SPACE Assembly_Library_Data_Space;
.VAR logs_data[8] = 0.70710678118654752440, /*C0 = sqrt(.5)*/
0.5, /*Constant used*/
2.0, /*Constant used*/
-5.578873750242, /*b0*/
0.1360095468621E-1, /*a1*/
-0.4649062303464, /*a0*/
-2.121944400546905827679E-4, /*C2*/
0.693359375; /*C1*/
.ENDSEG;
Then
exp(X) = exp(g) * CN
52
Trigonometric, Mathematical & 2
Transcendental Functions
where C = 2, and N is calculated as X/ln(C). Since the accuracy of the
approximation critically depends on the accuracy of g, the effective
precision of the ADSP-210xx processor must be extended during the
calculation of g. Use the equation
g = (X – XN * C1) – XN * C2
The second step is to compute the exponential for the reduced argument.
Since you have now calculated N and g, you must approximate exp(g).
Cody and Waite have derived coefficients (p1, q1, etc.) especially for the
approximation of exp(g)/2. The divide by two is added to counteract
wobbly precision. The approximation of exp(g)/2 is
where
The third step is to reconstruct the desired function from its components.
The components of exp(X) are exp(g)=R(g), C=2, and N=X/ln(2). Therefore:
exp(X) = exp(g) * CN
= R(g) * 2(N+1)
Note that N was incremented by one due to the scaling that occurred in
the approximation of exp(g).
2.7.1 Implementation
EXP.ASM is an implementation of the exponent approximation for the
ADSP-21000 family. When assembling the file EXP.ASM you can specify
where coefficients are placed—either in data memory (DM) or program
memory (PM)— by using the -Didentifier switch at assembly
time.
53
2 Trigonometric, Mathematical &
Transcendental Functions
For example, to place the coefficients in data memory use the syntax
The code includes another comparison for the case where the input
produces the output 1. This only occurs when the input is equal to
0.000000001 or 1.0E-9. If the input equals this value, the subroutine ends
and returns a one.
The first step in the approximation is to compute g and N for the equation
exp(X) = exp(g) * 2N
Where
N=X/ln(2)
g = (X - XN * C1) – XN * C2
C1 = 0.693359375
C2 = –2.1219444005469058277E-4
where
Once R(g), the approximation for exp(g), is calculated, the approximation for exp(x) is
derived by using the following equation:
exp(X) = 2(approx(exp(g)/2)) * 2N
= approx(exp(g)/2) * 2(N+1)
= R(g) * 2(N+1)
The SCALB instruction scales the floating point value of R(g) by the exponent N + 1.
/***************************************************************
File Name
EXP.ASM
55
2 Trigonometric, Mathematical &
Transcendental Functions
Version
Version 0.03 8/6/90
Modified 9/27/93
Purpose
Subroutine to compute the exponential of its floating point input
Equations Implemented
Y=EXP(X)
Calling Parameters
F0 = Input Value
l_reg=0;
Return Values
F0 = Exponential of input
Registers Affected
F0, F1, F4, F7, F8, F10, F12
i_reg
Computation Time
38 Cycles
# PM locations
46 words
#DM locations
12 words (could be placed in PM instead)
***************************************************************/
#include “asm_glob.h”
.SEGMENT/PM Assembly_Library_Code_Space;
.PRECISION=MACHINE_PRECISION;
.GLOBAL exponential;
output_too_large: RTS (DB);
F0=10000; /*Set some error here*/
F0=10000;
output_too_small: RTS (DB);
F0 = .000001;
F0 = .000001;
output_one: RTS (DB);
F0 = 1.0;
F0 = 1.0;
exponential: i_reg=exponential_data; /*Skip one cycle after this*/
F1=PASS F0; /*Copy into F1*/
F12=ABS F1, F10=mem(i_reg,1); /*Fetch maximum input*/
COMP(F1,F10), F10=mem(i_reg,1); /*Error if greater than max*/
IF GT JUMP output_too_large; /*Return XMAX with error*/
COMP(F1,F10), F10=mem(i_reg,1); /*Test for input to small*/
IF LT JUMP output_too_small; /*Return 0 with error*/
COMP(F12,F10), F10=mem(i_reg,1); /*Check for output 1*/
56
Trigonometric, Mathematical & 2
Transcendental Functions
IF LT JUMP output_one; /*Simply return 1*/
F12=F1*F10, F8=F1; /*Compute N = X/ln(C)*/
R4=FIX F12; /*Round to nearest*/
F4=FLOAT R4, F0=mem(i_reg,1); /*Back to floating point*/
compute_g: F12=F0*F4, F0=mem(i_reg,1); /*Compute
XN*C1*/
F0=F0*F4, F12=F8-F12; /*Compute |X|-XN*C1,
and XN*C2*/
F8=F12-F0, F0=mem(i_reg,1); /*Compute g=(|X|-
XN*C1)-XN*C2*/
compute_R: F10=F8*F8; /*Compute z=g*g*/
F7=F10*F0, F0=mem(i_reg,1); /*Compute p1*z*/
F7=F7+F0, F0=mem(i_reg,1); /*Compute p1*z + p0*/
F7=F8*F7; /*Compute g*P(z) =
(p1*z+p0)*g*/
F12=F0*F10, F0=mem(i_reg,1); /*Compute q2*z*/
F12=F0+F12, F8=mem(i_reg,1); /*Compute q2*z +
q1*/
F12=F10*F12; /*Compute (q2*z+q1)*z)*/
F12=F8+F12; /*Compute
Q(z)=(q2*z+q1)*z+q0*/
F12=F12-F7, F10=mem(i_reg,1); /*Compute Q(z) - g*P(z)*/
F0=RECIPS F12; /*Get 4 bit seed
R0=1/D*/
F12=F0*F12; /*D(prime) = D*R0*/
F7=F0*F7, F0=F10-F12; /*F0=R1=2-D(prime),
F7=N*R0*/
F12=F0*F12; /
*F12=D(prime)=D(prime)*R1*/
F7=F0*F7, F0=F10-F12; /*F7=N*R0*R1,
F0=R2=2-D(prime)*/
F12=F0*F12; /
*F12=D(prime)=D(prime)*R2*/
F7=F0*F7, F0=F10-F12; /*F7=N*R0*R1*R2,
F0=R3=2-D(prime)*/
F7=F0*F7; /*F7=N*R0*R1*R2*R3*/
F7=F7+F8; /*R(g) = .5 +
(g*P(z))/(Q(z)-
g*P(z))*/
R4=FIX F4; /*Get N in fixed point
again*/
RTS (DB);
R4=R4+1;
F0=SCALB F7 BY R4; /*R(g) * 2^(N+1)*/
.ENDSEG;
.SEGMENT/SPACE Assembly_Library_Data_Space;
.PRECISION=MEMORY_PRECISION;
.VAR exponential_data[12] = 88.0, /*BIGX */
-88.0, /*SMALLX*/
0.000000001, /*eps*/
1.4426950408889634074, /*1/ln(2.0)*/
0.693359375, /*C1*/
-2.1219444005469058277E-4, /*C2*/
0.59504254977591E-2, /*P1*/
0.24999999999992, /*P0*/
0.29729363682238E-3, /*Q2*/
57
2 Trigonometric, Mathematical &
Transcendental Functions
0.53567517645222E-1, /*Q1*/
0.5, /*Q0 and others*/
2.0; /*Used in divide*/
.ENDSEG;
x ** y = exp(y * ln(x))
V = V1 + V2
where
V1 = FLOAT(INTRND(V * 16))/16
and
V2 = V – V1
Z = 2W
58
Trigonometric, Mathematical & 2
Transcendental Functions
where
W = Y log2(X)
W=Y*U
X = f * 2m
where
1/2 ≤ f < 1
U1 = FLOAT(INTRND(U*16))/16
f = 2(–p/16) * g/a
a = precalculated coefficients
g = f * 2r = f (in the case of non-decimal processors r=0)
The reduced form of W is derived from the values of U1, U2, Y1, and Y2.
Since
W1 = m’ - p’/16
59
2 Trigonometric, Mathematical &
Transcendental Functions
2.8.1 Implementation
POW.ASM is an ADSP-21000 implementation of the power algorithm.
When assembling the file POW.ASM , you can specify where coefficients
are placed—either in data memory (DM) or program memory (PM)— by
using the -Didentifier switch at assembly time.
For example, to place the coefficients in data memory use the syntax
X = f * 2m
Use the LOGB function to recover the exponent of X. Since the floating
point format of the ADSP-210xx has a hidden scaling factor (see Appendix
D, “Numeric Formats,” in The ADSP-21020 User's Manual for details) you
must add a one to the exponent. This new exponent, m, scales X to
determine the value of f such that 1/2 ≤ f < 1.
f = X * 2 –m
set p = 1
if (g ≤ A1(9)), then p = 9
if (g ≤ A1(p+4)), then p = p + 4
if (g ≤ A1(p+2)), then p = p + 2
60
Trigonometric, Mathematical & 2
Transcendental Functions
Next, you must determine the values of U1 and U2. To determine U2, you
must implement a rational approximation. The equation for U2 is
U2 = (R + z * K) + z
z’ = 2 * [g - A1(p+1)] - A2 ((p+1)/2)
z = z’ + z’
To determine the value of R(z), Cody and Waite derived coefficients (p1, p2)
especially for this approximation. The equation is
where
v = z * z.
U1 = REDUCE(U)
U1 = FLOAT(INTRND(U * 16))/16
since
U = log2(X)
= log2(f * 2m)
= log2([2(-p/16) *g/a] * 2m)
= m - p/16
Therefore
U1 = FLOAT(INTRND(16*m-p)) * 0.0625
Y1 = REDUCE(Y)
61
2 Trigonometric, Mathematical &
Transcendental Functions
Y2 = Y – Y1
W = U2 * Y + U1 * Y2
W1 = REDUCE(W)
W2 = W - W1
W = W1 + U1 * Y1
W1 = REDUCE(W)
W2 = W2 + (W– W1)
W = REDUCE(W2)
IW1 = INT(16 * (W1 + W))
W2 = W2 – W
Now compare IW1 with the largest and smallest positive finite floating-
point numbers to test for overflow. If an overflow occurs, the subroutine
ends and an error value should be set.
For the next step IW2 must be less than or equal to zero. If W2 > 0, add one
to IW1 and subtract 1/16 from W2.
m’ = IW1/16 + 1
p’ = 16 * m’ – IW1
Z = W2 * Q(W2)
62
Trigonometric, Mathematical & 2
Transcendental Functions
Now, add 1 to Z and multiply by 2(–p’/16) using the equation
Z = (Z * A1(p’+1)) + A1(p1+1)
Version
Version 0.04 7/6/90
Purpose
Subroutine to compute x raise to the y power of its two floating point
inputs.
Equations Implemented
Y=POW(X)
Calling Parameters
F0 = X Input Value
F1 = Y Input Value
l_reg = 0
Return Values
F0 = Exponential of input
Registers Affected
F0, F1, F2, F4, F5, F7,
F8, F9, F10, F11, F12, F13, F14, F15
i_reg, ms_reg
Computation Time
37 Cycles
# PM locations
125 words
#DM locations
33 words (could be placed in PM instead)
***************************************************************/
#include “asm_glob.h”
#include “pow.h”
#define b_reg B3
63
2 Trigonometric, Mathematical &
Transcendental Functions
#define i_reg I3
#define l_reg L3
#define mem(i,m) DM(i,m)
#define m1_reg M7
#define mm_reg M6
#define ms_reg M5
#define SPACE DM
.SEGMENT/PM rst_svc;
jump pow;
.ENDSEG;
.SEGMENT/PM pm_sram;
.PRECISION=MACHINE_PRECISION;
.GLOBAL pow;
64
Trigonometric, Mathematical & 2
Transcendental Functions
R13=4;
COMP(F12,F15), F12=mem(2,i_reg); /*A1(p+4) - g*/
F11=mem(6,i_reg);
IF GE F12=F11;
IF GE R14=R14+R13; /*IF (g<=A1(p+4)) p=p+4*/
R13=2;
COMP(F12,F15); /*A1(p+2) - g*/
IF GE R14=R14+R13; /*IF (g<=A1(p+4)) p=p+2*/
determine_z: R14=R14+1, R4=R14;
ms_reg=R14;
i_reg=a1_values;
R11=ASHIFT R14 BY -1; /*Compute (p+1)/2*/
F12=mem(ms_reg,i_reg); /*Fetch A1(p+1)*/
ms_reg=R11;
i_reg=a2_values; /*Correction array*/
F0=F12+F15; /*g + A1(p+1)*/
F14=F15-F12, F11=mem(ms_reg,i_reg);
F7=F14-F11, F12=F0; /*[g-A1(p+1)]-A2((p+1)/2)*/
F11=2.0;
F7=F7+F7; /* z = z + z*/
determine_R: i_reg=power_array;
F8=F7*F7, F9=mem(p2,i_reg); /* v = z * z */
F10=F8*F9, F9=mem(p1,i_reg); /* p2*v */
F10=F10+F9; /* p2*v + p1 */
F10=F10*F8; /* (p2*v + p1) * v */
F10=F10*F7, F9=mem(K,i_reg); /* R(z) = (p2*v+p1)*v*z */
determine_u2: F11=F10*F9; /* K*R */
F11=F10+F11; /* K + K*R */
F9=F9*F7; /* z*K */
65
2 Trigonometric, Mathematical &
Transcendental Functions
F9=F9+F11; /* R + z*K */
F9=F9+F7; /* (R + z*K) + z*/
determine_u1: R3=16;
R2=R2*R3 (SSI); /* m*16 */
R2=R2-R4; /* m*16-p */
R3=-4;
F2=FLOAT R2 BY R3; /*FLOAT(m*16-p)*.0625*/
determine_w: R4=4; /*Used in reduce*/
R8=FIX F1 BY R4;
F8=FLOAT R8 BY R3; /* y1=REDUCE(Y) */
F7=F1-F8; /* y2 = y - y1 */
F15=F9*F1; /* U2*Y */
F14=F2*F7; /* U1*Y2 */
F15=F14+F15; /* W = U2*Y + U1*Y2*/
R14=FIX F15 BY R4;
F14=FLOAT R14 BY R3; /* W1=REDUCE(W) */
F13=F15-F14; /* W2 = W - W1 */
F12=F2*F8; /* U1*Y1 */
F12=F14+F12; /* W = W1 + U1*Y1 */
R14=FIX F12 BY R4;
F14=FLOAT R14 BY R3; /* W1 = REDUCE(W) */
F11=F12-F14; /* W-W1 */
F13=F11+F13; /* W2 = W2 + (W-W1) */
R12=FIX F13 BY R4;
F12=FLOAT R12 BY R3; /* W = REDUCE(W2) */
F10=F12+F14;
R10=FIX F10 BY R4; /* IW1 = INT(16*(W1+W)) */
F13=F13-F12, R9=mem(bigx,i_reg); /* W2 = W2 - W */
COMP(R10,R9), R9=mem(smallx,i_reg); /* Test for overflow */
IF GE JUMP overflow
COMP(R10,R9);
IF LE JUMP underflow;
flow_to_a: F13=PASS F13; /* W2 must be <=0 */
IF LE JUMP determine_mp;
F8=.0625;
F13=F13-F8;
R10=R10+1;
determine_mp: R8=1;
R10=PASS R10;
IF LT R8=R8-R8; /* I=0 if IWI < 0 */
R6=ABS R10; /* Take ABS for shift*/
R7=ASHIFT R6 BY -4; /* IW1/16 */
R6=PASS R10;
IF LT R7=-R7;
R7=R7+R8; /* m(prime) = IW1/16 + I */
R6=ASHIFT R7 BY 4; /* m(prime)*16 */
R6=R6-R10, F5=mem(q5,i_reg); /* p(prime) = 16*m(prime) - IW1 */
determine_Z: F4=F5*F13, F5=mem(q4,i_reg); /* q5*W2 */
F4=F4+F5, F5=mem(q3,i_reg); /* q5*W2 + q4 */
66
Trigonometric, Mathematical & 2
Transcendental Functions
F4=F4*F13; /* (q5*W2+q4)*W2 */
F4=F4+F5, F5=mem(q2,i_reg); /* (q5*W2+q4)*W2+q3 */
F4=F4*F13; /* ((q5*W2+q4)*W2+q3)*W2 */
F4=F4+F5, F5=mem(q1,i_reg); /* ((q5*W2+q4)*W2+q3)*W2+q2 */
F4=F4*F13; /* (((q5*W2+q4)*W2+q3)*W2+q2)*W2 */
F4=F4+F5; /*
Q(W2)=(((q5*W2+q4)*W2+q3)*W2+q2)*W2+q1*/
F4=F4*F13; /* Z = W2*Q(W2) */
i_reg=a1_values;
R6=R6+1; /* Compute p(prime)+1 */
ms_reg=R6;
F5=mem(ms_reg,i_reg); /* Fetch A1(p(prime)+1) */
F4=F4*F5; /* A1(p(prime)+1)*Z */
F4=F4+F5; /* Z = A1(p(prime)+1) +
A1(p(prime)+1)*Z */
F0=SCALB F4 BY R7; /* Result = ADX(Z,m(prime)) */
underflow: F0=0; /*Set error also*/
RTS;
67
2 Trigonometric, Mathematical &
Transcendental Functions
.ENDSEG;
.SEGMENT/DM dm_sram;
.VAR a1_values[18] = 0,
0x3F800000,
0x3F75257D,
0x3F6AC0C6,
0x3F60CCDE,
0x3F5744FC,
0x3F4E248C,
0x3F45672A ,
0x3F3D08A3 ,
0x3F3504F3 ,
0x3F2D583E ,
0x3F25FED6 ,
0x3F1EF532 ,
0x3F1837F0 ,
0x3F11C3D3 ,
0x3F0B95C1 ,
0x3F05AAC3 ,
0x3F000000;
.VAR a2_values[9] = 0,
0x31A92436,
0x336C2A95,
0x31A8FC24,
0x331F580C,
0x336A42A1,
0x32C12342,
0x32E75624,
0x32CF9891;
68
Trigonometric, Mathematical & 2
Transcendental Functions
0.961620659583789E-2, /* q4 */
0.130525515942810E-2, /* q5 */
0.44269504088896340736, /* K */
2032., /* bigx */
-2015; /* smallx */
.ENDSEG;
#ifdef PM_DATA
#endif
#ifdef DM_DATA
#define b_reg B3
#define i_reg I3
#define l_reg L3
#define mem(i,m) DM(i,m)
#define m1_reg M7
#define mm_reg M6
#define ms_reg M5
#define SPACE DM
#define Assembly_Library_Data_Space Lib_DMD
#endif
#define p1 0
69
#define p2 1
#define q1 2
#define q2 3
#define q3 4
#define q4 5
Matrix Functions 3
71
3 Matrix Functions
3.1 STORING A MATRIX
To minimize index register usage, the two-dimensional array matrix is
stored in a single-dimensional array buffer and element positions are kept
track of by the processor’s DAG registers. The program can access any
matrix element by using the index, modify, and length registers. The
elements are stored in row major order; all the elements of the first row are
first in the buffer, then all of the elements in the second row, and so forth.
To read the elements in a row consecutively, set the index pointer to the
location of the first element in the row and set the modify value to 1.
For example, to read the first column in the 3 ×3 matrix example, set the
index pointer to point to A11 and the set modify value to three (3). After an
indirect memory read of the first element, the index pointer points to A21,
the second column element.
This method of storing and accessing a matrix keeps available more DAG
registers than using a different index pointer for each row of the matrix.
72
Matrix Functions 3
3.2 MULTIPLICATION OF A M×N MATRIX BY AN N×1 VECTOR
This section discusses how to multiply a two-dimensional matrix of
arbitrary size, M×N, by a vector (one-dimensional matrix), N×1.
3.2.1 Implementation
The matrix elements are stored in the buffer mat_a , with length of M×N
(where M is the number of rows in a matrix and N is the number of
columns).
The M×N matrix and M×1 result vector are stored in a data memory
segment, while the N×1 multiply vector is stored in a program memory
segment. This lets the algorithm take advantage of the dual fetch of the
multifunction instructions to access the next two elements of the matrix
and vector while multiplying the current two elements.
The outer loop, column , takes the final accumulated result and stores it
in the mat_c result buffer. The loop also clears the accumulator result,
R8, so that it can be used again for the next row loop. The column loop
is performed M times (the number of rows ) to account for the number of
times it has to perform the row loop operations.
73
3 Matrix Functions
When performing the accumulation on the last elements of the last row,
the index pointers of the input buffers wrap around to the start of the
buffer; the multiplication and data fetches for the first row are repeated.
Since those operations are redundant, their destination registers can be
written over after the routine completes. Note that the index pointers are
also modified and point to the third elements in the matrices when the
routine is finished. Therefore, the pointers must be restored if the same
matrices must be used in a subsequent routine.
74
Matrix Functions 3
3.2.2 Code Listing–M×N By N×1 Multiplication
/*******************************************************************************
File Name
MxNxNx1.ASM
Version
25-APR-91
Purpose
Matrix times a Vector.
Equations Implemented
[Mx1]=A[MxN]*B[Nx1]
Calling Parameters
Constants: m, n
pm(mat_b[n]) row major, dm(mat_a[m*n]) row major,
M1=1;
M9=1;
B0=mat_a; L0=@mat_a;
B1=mat_c; L1=0;
B8=mat_b; L8=@mat_b;
Return Values
dm(mat_c[m]) row major
Registers Affected
F0,F4,F8,F12, I0,I1,I8
Cycle Count
cycles=6+M(3+N)+5 (entrance + core + 5 cache)
# PM Locations
pm code=8 words, pm data=n words
# DM Locations
dm data=m*n+m words
*********************************************************************************/
75
3 Matrix Functions
/* dimension constants */
#define M 4
#define N 4
#ifndef example
.GLOBAL mxnxnx1;
.EXTERN mat_a, mat_b,mat_c;
#endif
#ifdef example
.SEGMENT/DM dm_data;
.VAR mat_a[M*N]=”mat_a.dat”;
.VAR mat_c[M];
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR mat_b[N]=”mat_bb.dat”;
.ENDSEG;
.SEGMENT/PM rst_svc;
dmwait=0x21; /* set dm waitstates to zero */
pmwait=0x21; /* set pm waitstates to zero */
jump setup;
.ENDSEG;
76
Matrix Functions 3
3.3.1 Implementation
The two input matrices are stored in separate data memory and program
memory segments so multifunction instructions can be used to produce
efficient code.
Repeating this code, however, requires that the modifications of the index
pointers for each matrix be handled differently. The modify value for
mat_a (m1) is still set to 1 to increment through the matrix row by row.
However, to consecutively increment through the columns of mat_b , the
modify value (m10) needs to be set to the number of columns in that
matrix. For this code example, the second matrix is a 4 × 4 matrix, so the
modify value is 4. The result matrix is written to column by column and
has the same modify value.
After a single loop through the matrix-vector code, a column in the result
matrix is complete. Because the code is looped to reduce the number of
cycles inside the loop, the position of the index pointers are incorrect to
perform the next matrix-vector multiplication. Modify instructions are
included at the end of the rowcol loop that modify the index pointers so
they begin at the correct position. The index register for mat_a is
modified to point to the beginning of the first row and first column. The
index registers (pointers) for both mat_b and the result matrix mat_c
are modified to point to the beginning of the next column.
77
3 Matrix Functions
The pointer to the matrix mat_a is modified to point to the first element
of the first row by performing a dummy fetch with a modify value of –2.
(Because the loop is rolled, the DAG fetches the first two row elements
again.)
A dummy fetch is also used to modify the matrix pointer for mat_b to
point to the first element of the next column. The modify value for this
dummy fetch is –(O * 2 – 1), where O is the number of columns in the
matrix. Because of loop rolling, the index pointer points to the third
column element. Therefore, the index pointer needs to be adjusted
backwards by two rows minus one element. For this code example, the
index pointer points to the third column position. To make the pointer
point to the first element of the next column, modify the pointer by
–(4 * 2 – 1) = –7.
Finally, the pointer to the result matrix mat_c is modified to point to the
first element of the next column by modifying the index pointer by one.
Since the instruction that writes the result is the last one in the loop, loop
rolling does not affect this index pointer. Dummy reads modify the index
registers, instead of modify instructions, so that a multifunction
instruction can be performed. This reduces the number of cycles in the
loop.
The loop colrow is then executed again, which calculates the result for
the next column in the result matrix. The loop colrow repeats until all
the columns in the result matrix are filled.
To use the same matrices in a subsequent routine, first reset the index
pointers to the beginning of each buffer.
78
Matrix Functions 3
File Name
MxNxNxO.ASM
Version
25-APR-91
Purpose
Matrix times a Matrix.
Equations Implemented
C[MxO]=A[MxN]*B[NxO]
Calling Parameters
Constants: m, n, o
pm(mat_b[N*O]) row major, dm(mat_a[M*N]) row major
dm(mat_c[M*O]) row major
M1=1;
M2=-2;
M3=o;
M9=-(o*2-1);
M10=o;
B0=mat_a; L0=@mat_a;
B1=mat_c; L1=@mat_c;
B8=mat_b; L8=@mat_b;
Return Values
F0,F4,F8,F12, I0,I9, B8
Registers Affected
F0,F4,F8,F12, I0,I1,I8
Cycle Count
cycles=4+o(m(n+2)+5)+7 (entrance + core + 7 cache)
# PM Locations
pm code=11 words, pm data=NxO words,
# DM Locations
dm data=MxN+MxO words
(listing continues on next page)
*********************************************************************************/
79
3 Matrix Functions
/* dimension constants */
#define M 4
#define N 4
#define O 4
#ifndef example
.GLOBAL mxnxnxo;
.EXTERN mat_a, mat_b, mat_c;
#endif
#ifdef example
.SEGMENT/DM dm_data;
.VAR mat_a[M*N]=”mat_a.dat”;
.VAR mat_c[M*O];
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR mat_b[N*O]=”mat_b.dat”;
.ENDSEG;
Ax = b
The vector x contains the unknowns, the matrix A contains the set of
coefficients, and the vector b contains the solutions of the linear equations.
The matrix A must be a non-singular square matrix. To get the solution for
x, multiply the inverse of matrix A by the constant vector, b. The inverse
matrix is useful if a different constant vector b is used with the same
equations. The same inverse can be used to solve for the new solutions.
• The first section searches for the largest element of the matrix.
• The second section places that element on the diagonal of the matrix
making it the pivot element. The row that contained the pivot element
is marked so that it won’t be used again.
• The third section of code divides the rest of the row by that pivot
element.
81
3 Matrix Functions
• The first four sections are repeated until all of the rows of the matrix
have been checked for a pivot point.
• The fifth section performs the corrections for swapping rows and
columns.
3.4.1 Implementation
The matrix is in data memory and uses N×N locations, where N is the size
of the matrix (N=3 for a 3×3 matrix). The pivot flag (pf) and swap column
(swc) arrays are also stored in data memory. The swap row array (swr) is
in program memory.
This routine uses all of the universal registers (F0-F15) and eight of the
DAG registers (I0-I8) as either data storage or temporary registers.
Therefore, if this routine is called from a routine that uses these registers,
switch to the secondary registers before starting the routine. Make sure to
switch back to the primary registers when done.
The first four sections of the algorithm are enclosed in the full_pivot
loop. Each pass through the loop searches for the largest element in the
matrix, places that element on the diagonal, and performs the in-place
elimination. The loop repeats for every row of the matrix.
The first section of the algorithm searches the entire matrix for the largest
magnitude pivot value in the nested loops row_big and column_big .
At the beginning of each loop, it checks if the pivot flag is set for that row
or column. If the pivot flag is set, that row or column contains a pivot
point that has already been used. Since any element in a row or column
that previously contained a pivot point cannot be reused, the loop is
skipped and the index pointer is modified to point to the next row or
column.
The loop performs a comparison of all the elements in the matrix and the
largest value is stored in register F12—the pivot element. The row that
contained the pivot point is stored in the pf buffer so that any elements
in that row will not be used again. A test is performed to see if the pivot
element is a zero. If the pivot point is zero, the matrix is singular and does
not have a realizable inverse. The routine returns from the subroutine and
the error flag is register F12 containing a zero.
82
Matrix Functions 3
The second section of the algorithm checks if the pivot element is on the
diagonal of the matrix. If the pivot element is not on the diagonal,
corresponding rows and columns are swapped to place the element on the
diagonal. The position of the pivot element is stored in the counters R2
and R10. If these two numbers are equal, then the element is already on a
diagonal and the algorithm skips to the next section of the algorithm. If
the numbers are not equal, the loop swap_row is performed. This loop
swaps the corresponding row and column to place the pivot element on
the diagonal of the matrix. The row and column numbers that were
swapped are stored in separate arrays called swr (row) and swc
(column). These values will be used in the fifth section to correct for any
swapping that has occurred.
The third section of the algorithm divides all the elements of the row
containing the pivot point by the pivot point. The inverse of the pivot
point is found with the macro DIVIDE . The result of the macro is stored
in the f1 register. The other elements in the row are then multiplied by the
result in the loop divide_row .
The fourth section of the algorithm performs the in place elimination. The
elimination process occurs within the two loops fix_row and
fix_column . The results of the elimination replace the original elements
of the matrix.
The fifth section of the algorithm is executed after the entire matrix is
reduced. This section fixes the matrix if any row and column swapping
was done. The algorithm reads the values stored in the arrays swr and
swc and swaps the appropriate columns if the values are not zero.
83
3 Matrix Functions
File Name
MATINV.ASM
Version
May 6 1991
Purpose
Inverts a square matrix using the Gauss-Jordan elimination.
algorithm with full pivoting.
Equations Implemented
C[MxO]=A[MxN]*B[NxO]
Calling Parameters
dm(mat_a[n*n]) row major, dm(pf[n+1]), dm(swc[n]);
pm(swr[n]);
r14=n; (n= number of rows (columns))
m0=1; m1=-1;
m8=1; m9=-1;
b0=mat_a;
b1=pf;
b7=swc; l7=0;
b8=swr; l8=0;
Return Values
dm(mat_a[n*n]) row major;
f12=0.0 -> matrix is singular
Registers Affected
f0 - f15,
i0 - i7, i8, m2
Cycle Count
maximum number= worst case= 7.5n**3+25n**2+25.5n+23 (approximated)
# PM Locations
pm code= 93 words, pm data= n words
# DM Locations
dm data= n*n+2n+1 words
84
Matrix Functions 3
*********************************************************************************/
#include “macros.h”
#define n 3
#ifndef example
.GLOBAL mat_inv;
.EXTERN mat_a;
#endif
#ifdef example
.SEGMENT/DM dm_data;
.VAR mat_a[n*n]= “mat_a1.dat”;
.VAR pf[n+1];
.VAR swc[n];
.ENDSEG;
.SEGMENT/PM rst_svc;
dmwait=0x21; /*set dm waitstates to zero*/
pmwait=0x21; /*set pm waitstates to zero*/
jump setup;
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR swr[n];
.ENDSEG;
/* example calling code */
.SEGMENT/PM pm_code;
setup: b0=mat_a; /*i0 -> a(row,col)*/
b1=pf; /*i1 -> pf= pivot_flag*/
b7=swc; /*i7 -> swc= swap_col*/
b8=swr; /*i8 -> swr= swap_row*/
l7=0;
l8=0;
m0=1;
m1=-1;
m8=1;
m9=-1;
r14=n;
call mat_inv;
85
3 Matrix Functions
idle;
.ENDSEG;
#endif
86
Matrix Functions 3
modify(i4,m7); /*i4 -> a(icol,col)*/
i5=i4;
lcntr=r14, do swap_row until lce;
f4=dm(i3,0); /*f4= temp= a(irow,col)*/
f0=dm(i5,0); /*f0= a(icol,col)*/
dm(i3,1)=f0; /*a(irow,col)= a(icol,col)*/
swap_row: dm(i5,1)=f4; /*a(icol,col)= temp*/
87
3 Matrix Functions
3.5 REFERENCES
[EMBREE91] Embree, P. and B. Kimble. 1991. C Language Algorithms
For Digital Signal Processing. Englewood Cliffs, NJ:
Prentice Hall.
88
FIR & IIR Filters 4
• Audio processing
• Speech compression
• Modems
• Motor control
• DSP filters are programmable. The transfer function of the filter can be
changed by changing coefficients in memory. A single hardware
design can implement many different filters, by merely changing the
software.
89
4 FIR & IIR Filters
This chapter presents two classes of digital filters and their software
implementations on ADSP-21000 family processors. The first section of
this chapter discusses the Finite Impulse Response (FIR) filter and the
second section discusses the Infinite Impulse Response (IIR) filter.
An FIR filter is a weighted sum of a finite set of inputs. The equation for
an FIR filter is
m–1
y(n)= ∑ akx(n – k)
k=0
where
∞
y(n)= ∑ h(k) x(n – k)
k=0
90
FIR & IIR Filters 4
FIR filters have several advantages that make them more desirable than
IIR filters for certain design criteria
• FIR filters are always stable because they are made up solely of zeros
in the complex plane.
• FIR filters are easy to understand and implement. They can provide
quick solutions to engineering problems.
4.1.1 Implementation
The FIR filter program is presented in this section as an example, but it
easily can be modified and used in a real world system. It is helpful to
read the actual program listing along with the following description.
The FIR filtering code uses predefined input samples stored in a buffer
and coefficients that are stored in a data file. The code executes a five-tap
filter on nine predefined samples. After the program processes the ninth
sample it enters an infinite loop.
The code for the FIR filter consists of two assembly-language modules,
firtest.asm and fir.asm . The firtest.asm module sets up
buffers and pointers and calls fir.asm , the module that performs the
multiply-accumulate operations. Two other files are needed: an
architecture file and a data file. The file generic.ach is the architecture
file, which defines the hardware in terms of memory spaces and
peripherals. The fircoefs.dat file contains a list of filter coefficients
for the filter.
• TAPS is the number of taps of the filter. To change the number of taps,
change the value of TAPS and add or delete coefficients to the file
FIRCOEFS.DAT .
91
4 FIR & IIR Filters
• The buffer DLINE , in data memory, contains the delay line, a circular
buffer of past samples. The COEFS and DLINE buffers are used in
the fir.asm module and hold the values that are multiplied and
accumulated.
92
FIR & IIR Filters 4
After the buffer is associated with data address registers, the code
initializes the delay line. At the label FIR_INIT a loop that puts zeros in
the delay line buffer starts. This initialization loop is necessary because
without it the contents of the buffer would be undefined until five samples
are accessed.
• A value from the input buffer is placed in F0. This is how the input
sample is passed to the FIR.ASM module.
• R1 is set to a value that is the amount of times that the loop in the FIR
module has to repeat. As shown by the FIR equation, R1 is one less
than the number of multiplies that has to be executed for a given value
of n.
The sum of products operations, which comprise the core loop of the FIR
filter, are executed by FIR.ASM . The code starts by zeroing some registers
and loading data from buffers into other registers. At the line labeled FIR
, a trick is used to zero the R12 register and fetch an input sample to the
delay line buffer in a single cycle: use an XOR to zero R12 instead of
using an immediate move. The XOR is an ALU compute operation. The
ADSP-21000 family allows for a computation and a data movement to be
executed in the same instruction. If an immediate move instruction was
used to set R12 to zero, the data move could not be executed in the same
instruction.
Note that the pointer is post modifed when the input sample moves to the
delay line buffer. Because of this, the coefficients array must be arranged
so that the coefficient associated with K = max is first in the array. The
modify instruction moves the delay line pointer to the oldest value in the
delay line. The input sample just written to the delay line buffer is not
accessed until the rest of the buffer is accessed. See Figure 4.1.
93
4 FIR & IIR Filters
Dline Coefs
K=O
newest sample
K=4 I8
K=4
IO K=3
oldest sample
K=3 K=2
K=2 K=1
K=1 K=0
Position of pointers for delay line buffer and coefficent buffer after a new sample has just been
added to the delay line. Note the order of the coefficients. It refers to the equation in section 4.4.
The MACS loop, which computes the sum of products and executes
TAPS – 1 number of times, efficiently uses the multifunction capabilities of
the ADSP-21000 architecture. A multifunction instruction can fetch two
operands and perform a multiplication and an addition operation in a
single cycle.
• The operands for the addition are the results of the multiplication;
valid operands for the addition are not generated until the loop
executes twice. For the first two iterations of the loop, the code uses
“dummy” operands of zero. The third time through the loop and after,
the multiplication generates two valid operands for the addition.
The code and memory usage for this algorithm is determined by the
number of taps the filter has. The code in FIR.ASM is common to FIR
filter implementations. The code in FIRTEST.ASM varies depending on
how coefficients and data are input into the filter, and therefore is not
included in the benchmark.
94
FIR & IIR Filters 4
95
4 FIR & IIR Filters
4.1.2 Code Listings
4.1.2.1 Example Calling Routine
/***************************************************************************
File Name
FIRTEST.ASM
Version
.03 11/2/93
Purpose
This program is a shell program for the FIR.ASM code. This program
initializes buffers and pointers. It also calls the FIR.ASM code.
***************************************************************************/
96
FIR & IIR Filters 4
.ENDSEG;
fir_init:
lcntr=r0, do zero until lce;
zero: dm(i0,m0)=0; /*initalize the delay line to 0*/
rts;
.ENDSEG;
97
4 FIR & IIR Filters
Listing 4.1 firtest.asm
File Name
FIR.ASM
Version
0.03 11/2/93
Purpose
This is a subroutine that implements FIR filter code given
coefficents and samples.
Equation Implemented
y(n)=Summation from k= 0 to M of H sub k times x(n-k)
Calling Parameters
f0 = input sample x(n)
r1 = number of taps in the filter minus 1
b0 = address of the delay line buffer
m0 = modify value for the delay line buffer
l0 = length of the delay line buffer
b8 = address of the coefficent buffer
m8 = modify value for the coefficent buffer
l8 = length of the coefficent buffer
Return Values
f0 = output sample y(n)
Registers Affected
f0, f4, f8, f12
i0, i8
Cycle Count
6 + (Number of taps -1) + 2 cache misses
# PM Locations
7 instruction words
Number of taps locations for coefficents
# DM Locations
Number of taps of samples in the delay line
***************************************************************************/
98
FIR & IIR Filters 4
.GLOBAL fir;
.EXTERN coefs, dline;
.SEGMENT /PM pm_code;
.ENDSEG;
99
4 FIR & IIR Filters
Listing 4.2 fir.asm
The direct form is not commonly used in IIR filter design. Another form,
called the canonical form, (also called direct form II) uses half as many delay
stages, and therefore uses less memory. By assuming a linear time
invariant system, the direct form can be mathematically manipulated to
get the canonical form. The equation for the canonical form is
The input sample is x(n) and y(n) is the output sample. The term w(n) is
called the intermediate value; w(n – 1) is the previous value and w(n – 2)
the one before that. The variables a and b are the coefficients for the filter.
The example in this section uses a variant of the canonical form equation
that limits the summation in the canonical form to two. This type of IIR
filter is called a biquad and is a second order filter. Higher order filters can
be constructed by cascading several biquad filters together.
• IIR filters are made up of poles and zeros in the complex plane. The
poles give IIR filters an ability to realize transfer functions that FIR
filters cannot.
100
FIR & IIR Filters 4
• IIR filters are not necessarily stable—it is the designer’s task to ensure
stability.
4.2.1 Implementation
This section of the chapter describes the assembly code for the biquad IIR
filter. The IIR filter code executes a direct form II realization structure.
• The architecture file is GENERIC.ACH , the same file used in the FIR
filter code.
• The file INPUT.DAT contains data used to initialize the input buffer
for the IIR filter code. This file consists of 300 samples that represent an
impulse.
101
4 FIR & IIR Filters
The first buffer declared, INBUF , is 300 locations long. This buffer is
initialized with the data file INPUT.DAT , which contains input samples
for the filter. These input samples represent an impulse response so that
the impulse response of the filter can be examined.
The second buffer declared, OUTBUF , is also 300 locations long, and stores
102
FIR & IIR Filters 4
DLINE is the third buffer declared. This buffer holds the delay line that is
the w(n-1) and w(n-2) for each biquad stage of the filter. Since there are
two past intermediate values for each stage of a biquad, the length of this
buffer is twice as long as the number of stages. The delay line buffer is
ordered
1st stage w(n – 2), 1st stage w(n – 1), 2nd stage w(n – 2), 2nd stage
w(n – 1)…
The coefficients buffer, which is the next buffer initialized in the code,
must be ordered so that it matches the order of the delay line buffer. The
order is
1st stage a2, 1st stage a1, 1st stage b2, 1st stage b1, 2nd stage a2…
After the buffers are declared, the code in IIRMEM.ASM assigns pointers
to each data buffer. When assigning pointers to the buffers, a base register
(B) and a length (L) register in the data address generators must be
defined. When the base register is loaded with a value, the corresponding
index register (I) is automatically loaded with a value. Following is a list of
index register pointers and their corresponding buffers:
103
4 FIR & IIR Filters
The length (L register) of DLINE is six and COEFFS is twelve; the other
buffer lengths are set to zero to indicate that they are not circular. After a
buffer is traversed, the index register is specifically written with the base
address by an instruction.
This line sets F12 to zero (by performing the subtraction) and fetches a
delay line value and a coefficient value. These two values represent w(n –
2) and a2 for the first biquad stage. This instruction is executed outside of
the QUADS loop so that data is available to be multiplied in the first
instruction of the loop. Each iteration of the QUADS loop represents one
biquad stage of the filter. If there are three biquad stages of the filter, this
loop is repeated three times. The register R0 (which was assigned a value
equal to the number of stages of the filter in the IIRMEM.ASM module) is
104
FIR & IIR Filters 4
used to set the number of times that the loop will be repeated.
The computational registers used by the loop are F2, F3, F4, F8, and F12.
Following is a list of what these registers hold for each stage:
• The next biquad stage must be calculated and the loop executed again.
The addition in the first line of the loop calculates the final result of the
output of the previous stage.
• The biquad stages of filter executes and the addition (on the line with
105
4 FIR & IIR Filters
File Name
IIRMEM.ASM
Version
created 4/25/91
modified 12/1/93
Purpose
This program declares and initializes input, delay line, and output
buffers that are used in IIR filter calculations. The program then
calls the module CASCADE.ASM which calculates the IIR filter outputs.
The coefficients for the IIR filter are read from the file IIRCOEFS.DAT
which represent a sixth order bandpass elliptical filter.
Equation Implemented
Calling Parameters
Return Values
Registers Affected
Cycle Count
# PM Locations
# DM Locations
106
FIR & IIR Filters 4
***************************************************************************/
l3=0;
b4=outbuf; /*I4 points to output buffer*/
l4=0;
l1=0; /* Updates new values in delay
line,*/
/* b1 is defined in CASCADE.ASM */
m1=1;
l8=12; /*Length for coefficents, B8 is
defined
in filtering loop*/
m8=1;
b0=dline; /*I0 points to delay line buffer*/
call cascaded_biquad_init (db); /*Zero the delay line */
l0=6;
r0=SECTIONS; /* r0=number of times through
.ENDSEG;
File Name
CASCADE.ASM
Version
created 4/25/91
modified 12/1/93
Purpose
This module implements a biquad, canonical form, IIR filter. It needs
to passed an input sample, a coefficent buffer, and a delay line
108
FIR & IIR Filters 4
buffer. The coefficent buffer should be in the order a2, a1, b2, b1
for each biquad section section. The delay line buffer should be ordered
w(n-2), w(n-1) for each biquad section. See the next section, equations
implemented, to see what the aformentioned variables mean.
Equation Implemented
w(n) = x(n) + a1*w(n-1) + a2*w(n-2)
y(n) = w(n) + b1*w(n-1) + b2*w(n-2)
w(n)=x(n)-a1*w(n-1)-a2*w(n-2)
Since the code implements the w(n) equation with the additions
the coefficents for the filter have to take into account the
minus sign.
Calling Parameters
f8 = input sample x(n)
r0 = number of biquad sections
b0 = address of DELAY LINE BUFFER
b8 = address of COEFFICENT BUFFER
m1 = 1, modify value for delay line buffer
m8 = 1, modify value for coefficent buffer
l0 = 0
l1 = 0
l8 = 0
Return Values
f8 = output sample y(n)
Registers Affected
f2, f3, f4, f8, f12
i0, b1, i1, i8
Cycle Count
6 + 4*(number of biquad sections) + 5 cache misses
109
4 FIR & IIR Filters
# PM Locations
10 instruction words
4 * (number of biquad sections) locations for coefficents
# DM Locations
2 * (number of biquad sections) locations for the delay line
***************************************************************************/
.GLOBAL cascaded_biquad;
.SEGMENT/PM pm_code;
cascaded_biquad:
/*Call this for every sample to be filtered*/
b1=b0;
/*I1 used to update delay line with new values*/
f12=f12-f12, f2=dm(i0,m1), f4=pm(i8,m8);
/*set f12=0, get a2 coefficent, get w(n-2)*/
lcntr=r0, do quads until lce;
/*execute quads loop once for each biquad section */
f12=f2*f4, f8=f8+f12, f3=dm(i0,m1), f4=pm(i8,m8);
/* a2*w(n-2), x(n)+0 or y(n) for a section, get w(n-1), get a1*/
110
FIR & IIR Filters 4
section*/
rts (db),f8=f8+f12;
/*return, compute y(n)*/
nop;
nop;
.ENDSEG;
4.3 SUMMARY
FIR and IIR filters each have their own characteristics and advantages that
makes one or the other suitable for a particular application. If linear phase
is a critical factor, then an FIR filter is best. If a design has tight memory
and instruction constraints, then choose an IIR filter.
4.4 REFERENCES
111
Multirate Filters 5
Decimation and interpolation are the two basic types of multirate filtering
processes.
113
5 Multirate Filters
5.1.1 Implementation
The decimate.asm code starts by initializing the data buffers
(input_buf , data , and coef ) and the respective DAG registers. The
input buffer, input_buf , is M long where M is the decimation factor.
The data buffer, data , and filter coefficient buffer, coef , are both N
locations long, where N is the number of taps for the filter. The I7 and L7
registers are specifically used for the input_buf so that the circular
buffer overflow interrupt can be used. The timer period is also
programmed to the desired input frequency. Since the timer interrupt
happens during the circular buffer overflow service routine, interrupt
nesting is enabled.
114
Multirate Filters 5
The timer interrupt service routine ( tmzh_svc ) gets the input sample
from an A/D converter port ( adc ) and stores the samples in the input
buffer input_buf . When M samples have been acquired, the input
buffer is full and a circular buffer overflow interrupt is generated.
The circular buffer overflow interrupt service routine ( cb7_svc ) calls the
decimator routine ( decimate ) only every M times, not at every sample.
This removes M–1 samples in between decimator calls.
The decimate routine transfers the input buffer to the data buffer as input
for the decimation filter, converting from the sampled 16-bit fixed point
number to an equivalent 32-bit floating point number. This double
buffering allows the filter computations to proceed in parallel with the
acquisition of the next M inputs, allowing a larger order filter than if all
the filter calculations are made between input sample periods.
To save cycles, the data transfer is performed in a rolled loop to allow the
delayed branch (db) call. The fix-to-float and scaling operations are
performed in parallel, also to save cycles. If the decimation factor is less
than four, the data conversion and transfer is more efficient if performed
unrolled.
Since the acquisition of the input data and calculation of the FIR filter are
in parallel, the decimator must calculate one pass of the FIR filter while M
samples are being read into the input buffer, input_buf . The following
equation determines the value of the timer period:
115
5 Multirate Filters
buffer at the start of the cb7_svc routine before the first timer interrupt
occurs again. Because the entire decimate routine (19 cycles) must be
executed between each sampling, there won’t be enough time for this if
the timer period is not at least 20.
116
Multirate Filters 5
5.1.2 Code Listings–decimate.asm
/************************************************************************
File Name
DECIMATE.ASM
Version
3/4/91
Purpose
Real Time Decimator
Calling Parameters
Input: adc
r15 = ADC input data
Return Values
Output: dac
r0 = DAC output data
Registers Affected
r0 = FIR data in / ADC fltg-pt data
r4 = FIR coefficients / temporary reg
r8 = FIR accumulate result reg
r12 = FIR multiply result reg / temporary reg
r14 = 16 = exponent scaling value
Start Labels:
init_dec reset-time initialization
decimate called by CB7 interrupt
Computation Time:
cb7_svc = 6 + decimate = N + M*2 + 11
decimate = N + M*2 + 5 + [5 misses]
tmzh_svc = 5
# PM Locations
47 Words Code, N Words Data
# DM Locations
N + M Words Data
************************************************************************/
117
5 Multirate Filters
Creating & Using the Test Case:
coef.dat contains a 32-tap FIR filter which bandlimits the input signal to 1/8
of the input frequency. Since the decimator in the test case decimates by a
factor M=4, this bandlimitation is equivalent to the required limit of 1/2 the
output frequency. The filter is a Parks-McClellan filter with a passband
ripple of 1.5dB out to about 1/10 the input frequency, and a stopband with
greater than 70dB of attenuation above 1/8 the input frequency.
The example data file is wave.1. It contains 512 signed fixed-point data
which emulate a 16-bit A/D converter whose bits are read from Data Memory bits
39:24. wave.1 contains a composite waveform generated from 3 sine waves of
differing frequency and magnitude. The cb7_svc routine arithmetically shifts
this data to the right by 16 bits, effectively zero-ing out the garbage data
which would be found in bits 23:0 if a real 16-bit ADC port were read.
The test case writes the decimated output to a dac port. Since there are 512
samples in wave.1, and the test case performs decimation by 4, there will be
128 data values written to the dac port if all data samples are read in and
processed. The values written out are left-shifted by 16 bits in parallel
with the float—>fixed conversion, based again on the assumption that the D/A
converter port is located in the upper 16 bits of data memory.
2. In the simulator,
a. Go to the program memory (disassembled) window,
d. Go to symbol cb7_done, set to break on 129th occurrence, and
c. Run
118
Multirate Filters 5
3. Compare wave.1 to out.1 on the graphing program of your choice.
———————————————————————————————————————
#include “def21020.h”
#define N 32 /* number of taps */
#define M 4 /* decimation factor */
#define FP 20.0e6 /* Processor Frequency = 20MHz */
#define FI 64.0e3 /* Interrupt Frequency = 64KHz */
/*———————————————————————————————————————
RESET Service Routine
———————————————————————————————————————*/
.SEGMENT /pm rst_svc;
rst_svc: PMWAIT=0x0021; /* no wait states,internal ack only */
DMWAIT=0x8421; /* no wait states,internal ack only */
jump init_dec; /* initialize the test case */
.ENDSEG;
/*———————————————————————————————————————
TMZH Service Routine
———————————————————————————————————————*/
.SEGMENT /pm tmzh_svc;
/* sample input: this code occurs at the sample rate */
tmzh_svc: rti(db);
r15=dm(adc); /* get input sample */
dm(i7,m0)=r15; /* load in M long buffer */
.ENDSEG;
119
5 Multirate Filters
/*———————————————————————————————————————
DAG1 Circular Buffer 7 Overflow Service Routine
———————————————————————————————————————*/
.SEGMENT /pm cb7_svc;
/* process input data when circular buffer 7 wraps around (“overflows”) */
/* this code occurs at 1/M times the sample rate */
cb7_svc: call decimate (db); /* call the decimator */
r12=dm(i7,m0);
r4=ashift r12 by -16;
rti (db); /* return from interrupt */
r0 = fix f0 by r14; /* convert to fixed point */
cb7_done: dm(dac)=r0; /* output data sample */
.ENDSEG;
/*———————————————————————————————————————
efficient decimator initialization
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
init_dec: b0=data; l0=@data; m0=1; /* data buffer */
b7=input_buf; l7=@input_buf; /* input buffer */
b8=coef; l8=@coef; m8=1; /* coefficient buffer */
wait: idle;
jump wait; /* infinite wait loop */
.ENDSEG;
120
Multirate Filters 5
/*———————————————————————————————————————
efficient decimator routine
executes at 1/M times the sample rate
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
decimate: /* Transfer input samples to the FIR data memory space */
/* using the input buffer working pointer. Perform */
/* fix—>float conversion and right-shifting in parallel. */
/* Note: this loop MUST be unrolled if M<3. It would be */
/* more efficient if unrolled for any M<4. The 1st two */
/* instructions of this loop have been executed in the delay */
/* field of the call to this subroutine */
f0=float r4, r12=dm(i7,m0);
lcntr=M-2, do load_data until lce;
r4=ashift r12 by -16, dm(i0,m0)=f0;
load_data: f0=float r4, r12=dm(i7,m0);
r4=ashift r12 by -16, dm(i0,m0)=f0;
f0=float r4;
dm(i0,m0)=f0;
.ENDSEG;
121
5 Multirate Filters
5.2 SINGLE-STAGE INTERPOLATION FILTER
A single-stage interpolation filter reconstructs a discrete-time signal from
another discrete-time signal. This is analogous to the decimator where a
sampled signal is sampled again at a different rate. By inserting more data
points between the originally sampled data points, interpolation increases
the output sample rate. Interpolation performs the function of a digital
anti-imaging filter on the original signal. Many digital audio systems, such
as compact disc and digital audio tape players, use interpolation
(oversampling) to avoid using high-performance analog reconstruction
filters. The anti-imaging functions in interpolators allow the use of low-
order analog filters on the D/A outputs.
5.2.1 Implementation
The interpol.asm code (Listing 5.2) uses an N-tap Direct Form Finite
Impulse Response Filter in an efficient algorithm to interpolate by L=4.
This increases the input sample rate by a factor of four. The number of
taps is restricted such that N /L must be an integer. An undesirable by-
product of the FIR filter is the attenuation of the input signal by a factor of
L. To correct this, the input data is scaled up by L before it is filtered.
Because most of the delay line inputs are zeros, calculation time would be
wasted if the zero inputs were multiplied by their corresponding
coefficients. Instead, the coefficients are interleaved to perform only the
non-trivial calculations. This technique is known as Poly-phase filtering.
Thus, the delay line must be only N/L locations long. In this example N=32
and L=4, so the delay line, data , is eight location long. The coefficient
buffer, coef , is 32 locations long. The example initializes the delay line
address modifier to one and the coefficient buffer address modifier to L =
4, for the proper interleave.
The internal timer generates interrupts at l times the output sample rate,
which is the desired output frequency. The register R2 is used as a counter
in the timer service routine ( tmzh_svc ) to get a new sample from the A/
D converter ( sample ) every Lth pass of the interrupt. When a new
122
Multirate Filters 5
sample is read in, it is converted from a 16-bit fixed point value to a 32-bit
floating point value and scaled up by L to compensate for the attenuation
effect. The routine also resets the counter to L and modifies the coefficient
pointer by L for proper interleave.
The filter output is converted back to a 16-bit fixed point value and
written to a D/A converter ( dac ) after each interpolation. The technique
of rolled loops and delayed branches are used to conserve cycles.
123
5 Multirate Filters
5.2.2 Code Listing–interpol.asm
/************************************************************************
File Name
INTERPOL.ASM
Version
3/6/91
Purpose
Real time Interpolator
Calling Parameters
Input: adc
r15 = ADC input data
Return Values
Output: dac
r0 = DAC output data
Registers Affected
r0 = FIR data in / ADC data / temp register
r1 = scale factor L
r2 = counter
r4 = FIR coefficients
r8 = FIR accumulate result reg
r12 = FIR multiply result reg / temporary reg
r14 = 16 = exponent scale factor
Start Labels
init_int reset-time initialization
interpol called by Timer Zero High Priority interrupt
Computation Time
tmzh_svc = 5 + interpol
= N/L + 9 (no sample)
= N/L + 17 (with sample)
# PM Locations
44 Words Code, N Words Data
# DM Locations
N/L Words Data
************************************************************************/
124
Multirate Filters 5
——————————————————————————————————————
Creating & Using the Test Case:
coef.dat contains a 32-tap FIR filter which bandlimits the output signal to
1/8 of the output frequency. Since the interpolator in the test case
interpolates by a factor M=4, this bandlimitation is equivalent to the
required limit of 1/2 the input frequency. The filter is a Parks-McClellan
filter with a passband ripple of 1.5dB out to about 1/10 its input frequency,
and a stopband with greater than 70dB of attenuation above 1/8 the input
frequency.
The example data file is wave.1. It contains 512 signed fixed-point data
which emulate a 16-bit A/D converter whose bits are read from Data Memory bits
39:24. wave.1 contains a composite waveform generated from 3 sine waves of
differing frequency and magnitude. The cb7_svc routine arithmetically shifts
this data to the right by 16 bits, effectively zero-ing out the garbage data
which would be found in bits 23:0 if a real 16-bit ADC port were read.
The test case writes the interpolated output to a dac port. Since there are
512 samples in wave.1, and the test case performs interpolation by 4, there
will be 2048 data values written to the dac port if all data samples are read
in and processed. The values written out are left-shifted by 16 bits in
parallel with the float—>fixed conversion, based again on the assumption that
the D/A converter port is located in the upper 16 bits of data memory.
2. In the simulator,
a. Go to the program memory (disassembled) window,
d. Go to symbol sample, set to break on 513th occurence, and
c. Run
——————————————————————————————————————*/
125
5 Multirate Filters
#include “def21020.h”
#define N 32 /* number of taps */
#define L 4 /* interpolate by factor of L */
#define FP 20.0e6 /* Processor Frequency = 20MHz */
#define FI 64.0e3 /* Interrupt Frequency = 64KHz */
/*———————————————————————————————————————
RESET Service Routine
———————————————————————————————————————*/
.SEGMENT /pm rst_svc;
rst_svc: PMWAIT=0x0021; /* no wait states,internal ack only */
DMWAIT=0x8421; /* no wait states,internal ack only */
jump init_int; /* initialize the test case */
.ENDSEG;
/*———————————————————————————————————————
TMZH Service Routine
———————————————————————————————————————*/
.SEGMENT /pm tmzh_svc;
/* sample input: this code occurs at the L*input rate */
tmzh_svc: r2=r2-1,modify(i8,m9); /* decrement counter, and */
/* shift coef pointer back */
if eq call sample; /* test and input if L times */
.ENDSEG;
126
Multirate Filters 5
/*———————————————————————————————————————
Initialize Interpolation Routine
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
init_int: /* initialize buffer index registers */
b0=data; l0=@data; m0=1;
b8=coef; l8=@coef; m8=L; /* modifier for coef is L */
m9=-1;
/* Register Initialization */
r1=L; /* value for upscaling sample */
f1=float r1; /* fix —> float conversion */
r2=1; /* set counter to 1 for 1st sample */
r14=16; /* exponent scale factor */
wait: idle;
jump wait; /* infinite wait loop */
.ENDSEG;
/*———————————————————————————————————————
sample & interpolate
code executes at the (L*input_rate)
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
interpol: /* FIR Filter, occurs at L times the input input rate. */
/* First two instructions of this filter have been run */
/* in the delay field of the jump to this routine */
f12=f0*f4, f0=dm(i0,m0), f4=pm(i8,m8);
lcntr=N/L-3, do taps until lce;
taps: f12=f0*f4, f8=f8+f12, f0=dm(i0,m0), f4=pm(i8,m8);
f12=f0*f4, f8=f8+f12;
f0=f8+f12;
rti (db);
r0 = fix f0 by r14; /* float -> fixed */
dm(dac)=r0; /* output data sample */
.ENDSEG;
127
5 Multirate Filters
/*———————————————————————————————————————
get sample
code executes at the input_rate
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
/* input data sample once every L times (i.e., at the input rate) */
sample: r0=dm(adc); /* input data sample */
/* input sample occuppies bits 39:24; shift to 15:0 */
r0 = ashift r0 by -16; /* shift w/ sign extend */
/* do fix->float, and shift coef pointer up by L */
f0=float r0, modify(i8,m8);
rts(db), f0=f0*f1; /* upscale sample by L */
dm(i0,m0)=f0; /* update delay line */
r2=m8; /* reset counter to L */
.ENDSEG;
128
Multirate Filters 5
5.3 RATIONAL RATE CHANGER (TIMER-BASED)
This section describes a rational rate changer implemented by cascading
an interpolating filter and a decimating filter. Interpolating filters and
decimating filters can only change the rate of a signal by a factor that is an
integer. This restriction makes these filters useless for many real world
applications when used by themselves. By cascading these filters, the
frequency may be changed by a factor that is a rational number.
For example, to transfer data between a compact disc with a sampling rate
of 44.1 kHz to a digital audio tape player with a sampling rate of 48.0 kHz,
the sampling rate of the compact disc must be increased by a factor of 48/
44.1, which is obviously not an integer. By using a combination of an
interpolation filter and decimation filter, a rational number can be
determined that approximates the desired frequency change.
The interpolation must take place first so as not to lose some of the desired
output frequency bandwidth that otherwise is filtered out by the
decimator. The anti-imaging filter (which operates after interpolation) and
anti-aliasing filter (which operates before decimation) can be combined
into a single low-pass filter.
5.3.1 Implementation
This example, ratiobuf.asm (Listing 5.3), uses an N-tap (in this case,
N = 32) real time direct FIR filter that performs interpolation by L = 4 and
decimation by M = 3 for a L/M = 4/3 change in the input sample rate. The
same techniques used in both interpolation and decimation filters are
utilized to dedicate an entire output period for calculating one output
sample. The ratio N/L is restricted to be an integer.
Like the interpolation algorithm, the ratio changer samples the incoming
signal at L times the sampling rate using the timer interrupt, and only
updates the input buffer every Lth pass. Also, every zero-valued
multiplication is skipped by interleaving the coefficients. Like the
decimator algorithm, the input is double-buffered so that the filter
computations can occur in parallel with the input and output of data. This
allows the use of a larger order filter for a given input sample rate.
This ratio changer implementation is counter based. The counter for the
inputs, IN_CNTR , is set to L, and the counter for the outputs, OUT_CNTR
, is set to M. If the IN_CNTR expires, a new input sample is read into the
input buffer. If the OUT_CNTR expires, a flag is set to indicate that the
main routine must calculate a new output sample.
129
5 Multirate Filters
The program then waits in an idle loop for the timer interrupt. Every timer
interrupt, the counters are decremented and checked to see if either or
both have expired:
• If the OUT_CNTR has not expired, the program returns to the idle
statement to wait for the next timer interrupt.
• If the value is zero, a new A/D value was not input since the last time
an output was calculated. The data transfer from input_buf to
data_buf is skipped.
130
Multirate Filters 5
The calc_out routine then adjusts the pointer to the coefficients for the
proper interleave, and executes the low-pass filter code to generate an
output. The low-pass filter is an N/L tap filter implemented as a rolled
loop, similar to both the single-stage interpolation and decimation filters.
For example, the index register I9 is used to track coefficient updates, and
must be modified in the output subroutine. These instructions are
executed:
modify(i9,-M);
m10=i9;
modify(i8,m10);
i9=0;
A dead cycle occurs after the second instruction and possibly after the
fourth instruction (if the fifth instruction uses DAG2). To avoid dead
cycles, the first two instructions are moved away from the third, and the
fourth instruction is moved to a place where a DAG2 operation does not
follow it.
131
5 Multirate Filters
There are two reasons why the user software interrupt 0 (USI0) is used to
call the calc_out routine. The first reason is that the calc_out
routine must be allowed to be interrupted by the timer interrupt service
routine. The timer interrupt occurs more frequently than the calc_out
routine. If the timer interrupt routine called the calc_out routine, the
timer interrupt would occur again before the calc_out routine had
completed. An interrupt routine cannot interrupt itself. Therefore, the
calc_out routine is called from a lower priority interrupt (USI0).
This code structure also allows an easier transition from the counter-based
rational rate changer algorithm to the external interrupt based algorithm
described in the next section.
132
Multirate Filters 5
5.3.2 Code Listings–ratiobuf.asm
/**************************************************************
File Name
RATIOBUF.ASM
Version
3/4/91
Purpose
Timer-Driven Real Time Rational Rate Changer
Calling Parameters
Input: adc
r0 = ADC input data
Return Values
Output: dac
r0 = DAC output data
Registers Affected
r0 = FIR data in / ADC data / temp register
r1 = temp register
r2 = input counter
r3 = output counter
r4 = FIR coefficients
r5 = scale value L
r8 = FIR accumulate result reg
r12 = FIR multiply result reg / temporary reg
r14 = 16 = exponent scale factor
Start Labels
init_rat reset-time initialization
input called by Timer Zero High Priority interrupt
calc_out called by Software Interrupt 0
Computation Time
tmzh_svc = 15
sft0_svc = NoverL + 2+intMoverL + 16
# PM Locations
67 Words Code, N Words Data
# DM Locations
N/L + M/L Words Data
**************************************************************/
133
5 Multirate Filters
/*———————————————————————————————————————
Creating & Using the Test Case:
The test case writes the rate-changed output to a dac port. Since there are
512 samples in wave.1, and the test case performs interpolation by 4 followed
by decimation by 3, there will be 682 data values written to the dac port if
all data samples are read in and processed. The values written out are
left-shifted by 16 bits in parallel with the float—>fixed conversion, based
again on the assumption that the D/A converter port is located in the upper 16
bits of data memory.
2. In the simulator,
a. Go to the program memory (disassembled) window,
d. Go to symbol input, set to break on 513th occurence, and
c. Run
——————————————————————————————————————*/
#include “def21020.h”
#ifndef TEST
/*—————————————————————————————*/
134
Multirate Filters 5
#define TPER_VAL 312 /* TPER_VAL = FP/FI - 1 */
#else
/*—————————————————————————————*/
#define TPER_VAL 39 /* interrupt every 40 cycles */
#endif
/*—————————————————————————————*/
/*———————————————————————————————————————
RESET Service Routine
———————————————————————————————————————*/
.SEGMENT /pm rst_svc;
rst_svc: PMWAIT=0x0021; /* no wait states,internal ack only */
DMWAIT=0x8421; /* no wait states,internal ack only */
jump init_rat; /* initialize the test case */
.ENDSEG;
/*———————————————————————————————————————
Timer=zero high priority Service Routine
———————————————————————————————————————*/
.SEGMENT /pm tmzh_svc;
tmzh_svc: IN_CNTR=IN_CNTR-1; /* test if time for input */
if ne jump skipin(db);
OUT_CNTR=OUT_CNTR-1; /* test if time for output */
nop;
jump input (db); /* calculate the input sample */
r0=dm(adc); /* get input sample */
r0=ashift r0 by -16; /* right-justify the data */
.ENDSEG;
/*———————————————————————————————————————
Software Interrupt 0 Service Routine
———————————————————————————————————————*/
.SEGMENT /pm sft0_svc;
sft0_svc: jump calc_out (db); /* calculate the output sample */
r0=i7; /* get current ptr to input buffer */
r1=input_buf; /* get start addr of input buffer */
135
5 Multirate Filters
/*———————————————————————————————————————
Initialize Interpolation Routine
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
init_rat: /* initialize buffer index registers */
b0=data; l0=@data; m0=1; /* data buffer */
b7=input_buf; l7=0; /* input buffer */
b8=coef; l8=@coef; m8=L; /* modifier for coef is L! */
b9=0;l9=0;m9=-M; /* track coefficient updates */
wait: idle;
jump wait; /* infinite wait loop */
.ENDSEG;
/*———————————————————————————————————————
Calculate output
initiated by software interrupt 0, code below occurs at the
output sample rate
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
calc_out: r1=r0-r1, i7=b7; /* calculate amount in inbuffer */
136
Multirate Filters 5
/* filter pass, occurs at L times the input sample rate */
fir: f0=dm(i0,m0), f4=pm(i8,m8);
f8=f0*f4, f0=dm(i0,m0), f4=pm(i8,m8);
f12=f0*f4, f0=dm(i0,m0), f4=pm(i8,m8);
lcntr=NoverL-3, do taps until lce;
taps: f12=f0*f4, f8=f8+f12, f0=dm(i0,m0), f4=pm(i8,m8);
f12=f0*f4, f8=f8+f12;
f0=f8+f12,i7=b7; /* reset input buffer in || w/comp */
.ENDSEG;
/*———————————————————————————————————————
Acquire input
initiated by Timer Interrupt, code occurs at L times the input rate
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
input: /* convert fixed->float, modify the coef update by L */
f0=float r0,modify(i9,m8);
f0=f0*f5; /* scale by L */
dm(i7,m0)=f0; /* load in M long buffer */
/* set AZ flag w/ OUT_CNTR, reset the input count to L */
OUT_CNTR=pass OUT_CNTR, IN_CNTR=m8;
output: rti(db);
/* cause output routine if OUT_CNTR==0 */
bit set irptl SFT0I;
OUT_CNTR=M; /* reset output counter to M */
.ENDSEG;
137
5 Multirate Filters
5.4 RATIONAL RATE CHANGER (EXTERNAL INTERRUPT-BASED)
The rational rate changer shown in rat2int.asm (Listing 5.4) performs
the same operation as the rational rate changer (counter based) algorithm.
The interrupt based algorithm, however, uses two external interrupts
instead of the counters to generate the proper interpolation and
decimation interrupts.
5.4.1 Implementation
The IRQ3 interrupt service routine performs the input function of the
rational rate changer. An external timer is necessary to cause this interrupt
to occur at the input sample rate. This interrupt replaces the timer
interrupt and IN_CNTR in the counter based changer. The IRQ2
interrupt service routine performs the output function of the rational rate
changer. Again, another external timer is necessary to cause the interrupt
to occur at the output sample rate. This interrupt routine replaces the User
Software Interrupt 0 routine and OUT_CNTR in the counter based
changer.
138
Multirate Filters 5
5.4.2 Code Listing–rat_2_int.asm
/************************************************************************
File Name
RAT_2INT.ASM
Version
3/4/91
Purpose
Interrupt-Driven Real Time Rational Rate Changer
Calling Parameters
Input: adc
r0 = ADC input data
Return Values
Output: dac
r0 = DAC output data
Registers Affected
r0 = FIR data in / ADC data / temp register
r1 = temp register
r4 = FIR coefficients
r5 = scale value L
r8 = FIR accumulate result reg
r12 = FIR multiply result reg / temporary reg
r14 = 16 = exponent scale factor
Start Labels
init_rat reset-time initialization
irq3_svc IRQ3 interrupt service routine for input
output called by IRQ2 interrupt service routine
Computation Time
irq2_svc = 6
irq3_svc = 3 + output
output >= NoverL + 14 ( Notes: 1,3 )
<= NoverL + 2*intMoverL + 13 ( Notes: 2,3 )
Notes:
1. w/ no input samples to transfer
2. w/max input samples to transfer
3. maximum of 5 cache misses on the 1st pass
# PM Locations
53 Words Code, N Words Data
# DM Locations
N/L + M/L Words Data
************************************************************************/
139
5 Multirate Filters
———————————————————————————————————————
Creating & Using the Test Case:
The test case writes the rate-changed output to a dac port. Since there are
512 samples in wave.1, and the test case performs interpolation by 4 followed
by decimation by 3, there will be 682 data values written to the dac port if
all data samples are read in and processed. The values written out are
left-shifted by 16 bits in parallel with the float—>fixed conversion, based
again on the assumption that the D/A converter port is located in the upper 16
bits of data memory.
2. In the simulator,
a. Run keystroke file setint.ksf, which programs periodic interrupts,
a. Go to the program memory (disassembled) window,
d. Go to symbol irq3_svc, set to break on 513th occurence, and
c. Run
——————————————————————————————————————*/
#include “def21020.h”
140
Multirate Filters 5
.PORT dac; /* Digital to Analog (output) converter */
.ENDSEG;
/*———————————————————————————————————————
RESET Service Routine
———————————————————————————————————————*/
.SEGMENT /pm rst_svc;
rst_svc: PMWAIT=0x0021; /* no wait states,internal ack only */
DMWAIT=0x8421; /* no wait states,internal ack only */
jump init_rat; /* initialize the test case */
.ENDSEG;
/*———————————————————————————————————————
IRQ3 Interrupt Service Routine
occurs at input rate
———————————————————————————————————————*/
.SEGMENT /pm irq3_svc;
irq3_svc: r0=dm(adc); /* get input sample */
r0=ashift r0 by -16; /* right-justify the data */
/* convert fixed->float, modify the coef update by L */
f0=float r0,modify(i9,m8);
rti(db);
f0=f0*f5; /* scale by L */
dm(i7,m0)=f0; /* load in input buffer */
.ENDSEG;
/*———————————————————————————————————————
IRQ2 Interrupt Service Routine
occurs at output rate
———————————————————————————————————————*/
.SEGMENT /pm irq2_svc;
irq2_svc: jump output (db); /* go to the output routine */
r0=i7; /* get input buffer pointer */
r1=input_buf; /* get start addr of input buffer */
.ENDSEG;
/*———————————————————————————————————————
Initialize Interpolation Routine
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
init_rat: b0=data; l0=@data; m0=1; /* data buffer */
b7=input_buf; l7=0; /* input buffer */
b8=coef; l8=@coef; m8=L; /* modifier for coef is L! */
b9=0;l9=0;m9=-M; /* track coefficient updates */
141
5 Multirate Filters
/* enable timer zero high priority, software 0 interrupts */
bit set imask IRQ3I|IRQ2I;
bit clr irptl IRQ3I|IRQ2I; /* clear any pending irpts */
bit set mode1 IRPTEN|ALUSAT; /* enable interrupts, ALU sat*/
.ENDSEG;
/*———————————————————————————————————————
Calculate output
initiated by IRQ2, occurs at output sample rate
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
output: r1=r0-r1,i7=b7; /* r1 = # samples in input buffer */
/* also reset input buffer */
.ENDSEG;
142
Multirate Filters 5
5.5 TWO-STAGE DECIMATION FILTER
In the previous example of a single-stage decimation filter, a single low-
pass filter is used to prevent the aliasing that otherwise occurs with the
rate compression. The filter presented here uses two stages to reduce the
number of calculations required with no loss of precision.
If the output frequency from the decimation filter is much smaller than the
input oversampled frequency, the transition band of the filter needs to be
very small. A small transition band requires a large number of taps, which
increases the computation time.
By cascading filters, the number of taps per stage can be reduced, thereby
increasing the computational efficiency. For example, the first stage filter
may not reduce the input sampling rate by very much, but the filter has a
very large transition band requiring fewer taps. The second stage filter
may then require a smaller transition band, but the output sample rate is
smaller as well, still only needing a few taps. This two-stage
implementation can be easily extended to more stages. See [CROCH83] for
design curves to help determine optimal rat changed factors for
intermediate stages. A two- or three-stage design provides a substantial
reduction in filter computations; further reductions in computations come
from adding more stages. Because the filters are cascaded, each stage must
have a passband ripple equal to the final passband ripple divided by the
number of stages used. The stopband ripple does not have to be less than
the final stopband ripple because each successive stage attenuates the
stopband ripple of the previous stage.
5.5.1 Implementation
This example (dec2stg.asm , Listing 5.5) uses two real time direct-form
FIR filters of N_1 and N_2 taps to decimate by M_1 * M_2 for a
decrease of 1/(M_1 * M_2) times the input sample rate. The use of an
input buffer allows the filter computations to proceed in parallel with the
acquisition of the next M_1 * M_2 inputs, allowing a larger order filter
than if all calculations are made between samples.
The timer interrupt is set at a rate equal to the input sample rate to store
all the incoming values in the input buffer. A counter ( CNTR ) is set to M_1
* M_2 . This allows the input buffer to have enough samples for both
stages of filters. When the counter expires, User Service Interrupt 0 (USI0)
is set. The USI0 interrupt service routine calls the dec2stg routine that
performs the data transfer and two-stage filter.
143
5 Multirate Filters
The length of the input buffer is twice as long as the decimation rate
(that is, 2 * M_1 * M_2 ) . This double-length buffer acts as two
separate input buffers. While one buffer is being transferred to the data
buffer for filtering, the second buffer stores values read in from the A/D.
This double buffering is accomplished by utilizing two pointers (I7 and I6)
that are offset from each other by M_1 * M_2 . The buffer is circular, so
that the pointers “ping-pong” between being the input pointer and the
transfer pointer.
Since the input buffer is twice as long as needed, the input samples are
now stored in the lower half of the input buffer with pointer I7, while the
transfer from the input buffer to the data buffer is done with pointer I6.
Since the double-buffer is circular, after the first time through the stages,
the pointers “ping-pong,” reversing roles. For more details, see [ADI90].
Because M_1 is less than three for this test case, the movement of data
from the input buffer to the buffer data1 is put at the start of the
subroutine dec2stg as straight-line (i.e., not looped) code. If M_1 is
greater than three, the following code is used:
r12=dm(i6,m0);
r4=ashift r12 by -16;
f0=float r4, r12=dm(i6,m0);
lcntr=M_1-2, do load_data until lce;
r4=ashift r12 by -16, dm(i0,m0)=f0;
load_data: f0=float r4, r12=dm(i6,m0);
r4=ashift r12 by -16, dm(i0,m0)=f0;
f0=float r4;
dm(i0,m0)=f0;
144
Multirate Filters 5
5.5.2 Code Listing–dec2stg.asm
/************************************************************************
File Name
DEC2STG.ASM
Version
3/18/91
Purpose
Two Stage Decimator
Calling Parameters
Input: adc
r15 = ADC input data
Return Values
Output: dac
r0 = DAC output data
Registers Affected
r0 = FIR data in / ADC fltg-pt data
r3 = counter
r4 = FIR coefficients
r8 = FIR accumulate result reg
r12 = FIR multiply result reg / temporary reg
r14 = 16 = exponent scale factor
Start Labels:
init_dec reset-time initialization
dec2stg called by software 0 interrupt
Computation Time:
sft0_svc = N_2 + 10 + M_2*(N_1 + 2*M_1 + 7)
tmzh_svc = 7
# PM Locations
73 Words Code, N_1 + N_2 Words Data
# DM Locations
N_1 + N_2 + M_1*M_2*2 Words
************************************************************************/
145
5 Multirate Filters
Creating & Using the Test Case:
coef1 & coef2 are the two parks-McClellan FIR filters used in this two-stage
decimator. Assume that the input sample rate is 192kHz, and the decimated
output is 192/4 = 48kHz. The first FIR filter has 32 taps, and is designed to
have maximum of 1dB of attenuation from 0 - 48kHz, then >90dB attenuation at
>96kHz. The second FIR filter has 128 taps, and is designed to have maximum
of 1dB of attenuation from 0 - 20kHz, then >90dB attenuation at >24kHz.
The test case writes the decimated output to a dac port. Since there are 512
samples in wave.1, and the test case performs decimation by 4, there will be
128 data values written to the dac port if all data samples are read in and
processed. The values written out are left-shifted by 16 bits in parallel
with the float—>fixed conversion, based again on the assumption that the D/A
converter port is located in the upper 16 bits of data memory.
2. In the simulator,
a. Go to the program memory (disassembled) window,
d. Go to symbol output, set to break on 129th occurence, and
c. Run
——————————————————————————————————————*/
#include “def21020.h”
#define N_1 32 /* number of taps in FIR #1 */
#define N_2 128 /* number of taps in FIR #2 */
#define M_1 2 /* 1st decimation factor */
#define M_2 2 /* 2nd decimation factor */
#define CNTR r3 /* counter */
#define FP 20.0e6 /* Processor Frequency = 20MHz */
#define FI 192.0e3 /* Input Frequency = 192KHz */
146
Multirate Filters 5
.SEGMENT /dm dm_data;
.VAR data1[N_1]; /* FIR input data #1 */
.VAR data2[N_2]; /* FIR input data #2 */
.VAR input_buf[M_1*M_2*2]; /* raw ADC data
*/
.ENDSEG;
/*———————————————————————————————————————
RESET Service Routine
———————————————————————————————————————*/
.SEGMENT /pm rst_svc;
rst_svc: PMWAIT=0x0021; /* no wait states,internal ack only
*/
DMWAIT=0x8421; /* no wait states,internal ack only
*/
call init_dec; /* initialize the test case */
wait: idle; /* infinite wait loop
*/
jump wait;
.ENDSEG;
/*———————————————————————————————————————
TMZH Service Routine
———————————————————————————————————————*/
.SEGMENT /pm tmzh_svc;
/* sample input: this code occurs at the sample rate
*/
tmzh_svc: /* decrement counter and get a sample
*/
CNTR=CNTR-1,r15=dm(i2,m2);
if eq jump zero_yes;
zero_no: rti(db);
nop;
dm(i7,m0)=r15; /* load in input buffer
*/
zero_yes: rti(db);
irptl=SFT0I; /* if counter==0, set interrupt
*/
dm(i7,m0)=r15; /* load in input buffer
(listing continues on next page)
147
5 Multirate Filters
*/
.ENDSEG;
/*———————————————————————————————————————
Interrupt Request 3 Service Routine
Dummy procedure so that the TMZH Service Routine does not fetch
instructions from empty memory
———————————————————————————————————————*/
.SEGMENT /pm irq3_svc;
/* dummy procedure */
irq3_svc: rti;
.ENDSEG;
/*———————————————————————————————————————
Software Interrupt 0 Service Routine
———————————————————————————————————————*/
.SEGMENT /pm sft0_svc;
/* process input data: this code occurs at 1/M times the sample rate
*/
sft0_svc: jump dec2stg (db); /* call the 2-stage decimator
*/
CNTR=M_1*M_2; /* reset input counter
*/
nop;
.ENDSEG;
/*———————————————————————————————————————
efficient decimator initialization
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
init_dec: b0=data1; l0=@data1; m0=1; /* data buffer #1
*/
b1=data2; l1=@data2; /* data buffer #2
*/
b8=coef1; l8=@coef1; m8=1; /* coefficient buffer #1
*/
b9=coef2; l9=@coef2; /* coefficient buffer #2
*/
b7=input_buf; l7=@input_buf; /* input buffer sample ptr
*/
b6=input_buf; l6=@input_buf; /* inp. buffer working ptr
*/
b2=adc; l2=1; m2=0; /* set A/D conv. pointer
*/
148
Multirate Filters 5
CNTR=M_1*M_2; /* set input counter
*/
rts (db);
/* enable Timer high priority, software 0 interrupts
*/
bit set imask TMZHI|SFT0I;
/* enable interrupts, nesting, ALU saturation
*/
bit set mode1 IRPTEN|NESTM|ALUSAT;
.ENDSEG;
/*———————————————————————————————————————
efficient decimator routine
executes at 1/(M_1*M_2) times the sample rate
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
dec2stg: lcntr=M_2, do stage_1 until lce;
/* transfer input samples to the FIR data memory space */
/* using the input buffer working pointer. Perform */
/* fix—>float conversion in parallel */
r12=dm(i6,m0);
r4=ashift r12 by -16;
f0=float r4, r12=dm(i6,m0);
r4=ashift r12 by -16, dm(i0,m0)=f0;
f0=float r4;
dm(i0,m0)=f0;
149
5 Multirate Filters
fir_1: f0=dm(i0,m0), f4=pm(i8,m8);
f8=f0*f4, f0=dm(i0,m0), f4=pm(i8,m8);
f12=f0*f4, f0=dm(i0,m0), f4=pm(i8,m8);
lcntr=N_1-3, do taps_1 until lce;
taps_1: f12=f0*f4, f8=f8+f12, f0=dm(i0,m0), f4=pm(i8,m8);
f12=f0*f4, f8=f8+f12;
f0=f8+f12;
rti (db);
r0 = fix f0 by r14; /* float —> fix
*/
output: dm(dac)=r0; /* output data sample
*/
.ENDSEG;
5.6.1 Implementation
Two delay lines are used, one for each filter stage. The delay line for the
first stage filter, int2stg , is loaded from the ADC. The delay line for the
150
Multirate Filters 5
second stage filter, do_fir2 , is loaded from the output of the first stage filter.
After the initialization of the buffers, the main program sits in an idle loop waiting
for timer interrupts. The timer interrupts occur at an interval of L_1 * L_2 times
the sample rate. In the timer service routine, the counters for each stage are
decremented. The counter with the smaller interpolation factor is decremented first (
CNT2=2 ) . If the counter has not expired, the service routine jumps to the do_fir2
routine. Therefore, the second stage of the filter is executed at an interval of L_1 *
L_2 times the input sample rate. If the counter has expired, the routine decrements
the second counter.
The second counter is set to the higher interpolation factor ( CNT=4 ) . If the counter
has not expired, the service routine jumps to the int2stg routine. This routine is
called at an interval of L_1 times the input sample rate. When the int2stg
routine is called, both the first stage and second stage filtering is performed.
If both counters have expired, the next input value is read from the ADC and then
both filter stages are performed. The filter computations and coefficient interleaving
are performed in the same manner as the single-stage interpolation filter.
Version
3/18/91
Purpose
Two Stage Interpolator
Calling Parameters
Input: adc
r15 = ADC input data
Return Values
151
5 Multirate Filters
Output: dac
r0 = DAC output data
Registers Affected
r1 = scale factor L
r2 = counter #1
r3 = counter #2
r4 = FIR coefficients
r8 = FIR accumulate result reg
r12 = FIR multiply result reg / temporary reg
r13 = 1 = counter compare register
r14 = 16 = exponent scale factor
Start Labels
init_int reset-time initialization
int2stg called by Timer Zero High Priority interrupt
Computation Time
tmzh_svc = N_1/L_1 + N_2/L_2 + 25
# PM Locations
67 Words Code, N_1 + N_2 Words Data
# DM Locations
N_1/L_1 + N_2/L_2 Words Data
************************************************************************/
———————————————————————————————————————
Creating & Using the Test Case:
coef.dat contains a 32-tap FIR filter which bandlimits the input signal to 1/8
of the input frequency. Since the decimator in the test case decimates by a
factor M=4, this bandlimitation is equivalent to the required limit of 1/2 the
output frequency. The filter is a Parks-McLellan filter with a passband
ripple of 1.5dB out to about 1/20 the input frequency, and a stopband with
greater than 70dB of attenuation above 1/8 the input frequency.
The data files are all of the form sin.X, where X ranges from 1 to 5. Each data
file contains 512 signed fixed-point data values in the range +/- 32767. These
data points are meant to resemble data read from a 16-bit A/D converter which
152
Multirate Filters 5
generates signed data. sin.1 through sin.4 contain simple sine waves at
different frequencies. sin.5 contains a composite waveform generated from 3
sine wave of differing frequency and magnitude.
The test case writes the interpolated output to a dac port. Since there are 512
samples in sin.X, and the test case performs decimation by 4, there will be 128
data values written to the dac port if all data samples are read in and
processed.
2. In the simulator,
a. Go to the program memory (disassembled) window,
d. Go to symbol output, set to break on 4097th occurence, and
c. Run
——————————————————————————————————————*/
#include “def21020.h”
#define N_1 32 /* number of taps, FIR #1 */
#define N_2 32 /* number of taps, FIR #2 */
#define L_1 4 /* interpolate by factor of L_1 */
#define L_2 2 /* interpolate by factor of L_2 */
#define CNT1 r2
#define CNT2 r3
153
5 Multirate Filters
.VAR data1[N_1/L_1]; /* ADC fixed-point data buffer #1 */
.VAR data2[N_2/L_2]; /* ADC fixed-point data buffer #2 */
.ENDSEG;
/*———————————————————————————————————————
RESET Service Routine
———————————————————————————————————————*/
.SEGMENT /pm rst_svc;
rst_svc: PMWAIT=0x0021; /* no wait states,internal ack only */
DMWAIT=0x8421; /* no wait states,internal ack only */
jump init_int; /* initialize the test case */
.ENDSEG;
/*———————————————————————————————————————
TMZH Service Routine
———————————————————————————————————————*/
.SEGMENT /pm tmzh_svc;
/* sample input: this interrupt occurs at the L_1*L_2*input rate */
tmzh_svc: CNT2=CNT2-1,modify(i9,m15); /* decrement counter, and */
/* shift coef #2 pointer back */
if ne jump do_fir2; /* do second FIR if CNT2!=0 */
.ENDSEG;
/*———————————————————————————————————————
Initialize Interpolation Routine
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
init_int: /* initialize buffer index registers */
b0=data1; l0=@data1; m0=1; /* data buffer #1 */
b1=data2; l1=@data2; /* data buffer #2 */
b8=coef1; l8=@coef1; m8=L_1; /* modifier for coef1 = L_1 */
b9=coef2; l9=@coef2; m9=L_2; /* modifier for coef2 = L_2 */
m15=-1;
/* Register Initialization */
r1=L_1*L_2; /* value for upscaling sample */
f1=float r1; /* fix —> float conversion */
154
Multirate Filters 5
CNT1=1; /* set interpolate cntrs to 1 */
CNT2=1; /* for first data sample */
r13=1; /* counter compare reg */
r14=16; /* exponent scale factor */
wait_interrupt: idle;
jump wait_interrupt; /* infinite wait loop */
.ENDSEG;
/*———————————————————————————————————————
Two-stage Interpolate
code executes at L_1*L_2*(sample_rate)
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
int2stg: /* filter pass, occurs at L_1 times the input sample rate */
f0=dm(i0,m0), f4=pm(i8,m8);
f8=f0*f4, f0=dm(i0,m0), f4=pm(i8,m8);
f12=f0*f4, f0=dm(i0,m0), f4=pm(i8,m8);
lcntr=N_1/L_1-3, do taps1 until lce;
taps1: f12=f0*f4, f8=f8+f12, f0=dm(i0,m0), f4=pm(i8,m8);
f12=f0*f4, f8=f8+f12;
f0=f8+f12;
dm(i1,m0)=f0;
modify(i9,m9);
CNT2=m9; /* reset counter #2 to L_2 */
155
5 Multirate Filters
f12=f0*f4, f8=f8+f12;
f0=f8+f12;
dm(i1,m0)=f0;
rti (db);
r0 = fix f0 by r14; /* float -> fixed
*/
output: dm(dac)=r0; /* output data sample
*/
.ENDSEG;
/*———————————————————————————————————————
Acquire Sample
———————————————————————————————————————*/
.SEGMENT /pm pm_code;
/* sample input: this code occurs at the input rate
*/
sample: /* do fix->float, and scale data up by L_1*L_2
*/
f0=float r15, modify(i8,m8);
f0=f0*f1; /* upscale sample
*/
jump int2stg (db);
dm(i0,m0)=f0; /* update delay line w/latest
*/
CNT1=m8; /* reset counter #1 to L_1
*/
.ENDSEG;
5.7 REFERENCES
[ADI90] Analog Devices Inc., Amy Mar, ed. 1990. Digital Signal
Processing Applications Using The ADSP-2100 Family.
Englewood Cliffs, NJ: Prentice Hall.
156
Adaptive Filters 6
6.1 INTRODUCTION
Fixed-frequency-response digital filters were discussed in the two
previous chapters. This chapter looks at filters with a frequency response,
or transfer function, that can change over time to match desired system
characteristics.
157
6 Adaptive Filters
6.1.1.1 System Identification
You can design controls for a dynamic system if you have a model that
describes the system in motion. Modeling is not easy with complex
physical phenomena, however. You can get information about the system
to be controlled from collecting experimental data of system responses to
given excitations. This process of constructing models and estimating the
best values of unknown parameters from experimental data is called
system identification.
Figure 6.1 shows a block diagram of the system identification model. The
unknown system is modeled by an FIR filter with adjustable coefficients.
Both the unknown time-variant system and FIR filter model are excited by
an input sequence u(n). The adaptive FIR filter output y(n) is compared
with the unknown system output d(n) to produce an estimation error e(n).
The estimation error represents the difference between the unknown
system output and the model (estimated) output. The estimation error e(n)
is then used as the input to an adaptive control algorithm which corrects
the individual tap weights of the filter. This process is repeated through
several iterations until the estimation error e(n) becomes sufficiently small
in some statistical sense. The resultant FIR filter response now represents
that of the previously unknown system.
d(n)
Unknown
System
+
u(n) e(n)
∑
-
Adaptive
Algorithm
158
Adaptive Filters 6
6.1.1.2 Adaptive Equalization For Data Transmission
Adaptive filters are used widely to provide equalization in data modems
that transmit data over speech-band and wider bandwidth channels. An
adaptive equalizer is employed to compensate for the distortion caused by
the transmission medium. Its operation involves a training mode followed
by a tracking mode.
159
6 Adaptive Filters
6.1.1.4 Linear Predictive Coding of Speech Signals
The method of linear predictive coding (LPC) is an example of a source
coding algorithm used for the digital representation of speech signals. So
called source coders are model dependent: they use previous knowledge
of how the speech signal was generated at its source. Source coders for
speech are generally referred to as vocoders and can operate at data rates of
4.8 Kbits/s or below. In LPC, the source vocal tract is modeled as a linear
all-pole filter whose parameters are determined adaptively from speech
samples by means of linear prediction. The speech samples u(n) are, in this
case, the desired response, while u(n-1) forms the inputs to the adaptive
FIR filter known as a prediction error filter. The error signal between u(n)
and the output of the FIR filter, y(n), is then minimized in the least-squares
sense to estimate the model parameters. The error signal and the model
parameters are encoded into a binary sequence and transmitted to the
destination. At the receiver, the speech signal is synthesized from the
model parameters and the error signal.
160
Adaptive Filters 6
6.1.2.1 Transversal Structure
Figure 6.2 shows the structure of a transversal FIR filter with N tap
weights (adjustable during the adaptation process) with values at time n
denoted as
N-1
y(n) = wT(n) u(n) = ∑ wi(n) u(n-i)
i=0
where T denotes transpose, n is the time index, and N is the order of the
filter.
w0 w1 • • • w N-2 w N-1
∑ • • • ∑ ∑ y(n)
161
6 Adaptive Filters
6.1.2.2 Symmetric Transversal Structure
The characteristic of linear phase response in a filter is sometimes
desirable because it allows a system to reject or shape energy bands of the
spectrum and still maintain the basic pulse integrity with a constant filter
group delay. Imaging and digital communications are examples of
applications where this characteristic is desirable.
N/2
y(n) = ∑ wi(n)[u(n – i) + u(n – N + 1 + i)]
i=0
N
u(n-1) u(u- +1)
2
u(n) z -1 z -1 • • • z -1
∑ ∑ ∑ ∑ z -1
z -1 z -1 • • • z -1
N
u(n-N+1) u(n-N+2) u(n- )2
w w1 w N-2 w N-1
0 2 2
∑ • • • ∑ ∑ y(n)
162
Adaptive Filters 6
6.1.2.3 Lattice Structure
The lattice filter has a modular structure with cascaded identical stages.
Figure 6.4 shows one stage of a lattice FIR structure.
+
f m-1 (n) ∑ f m (n)
-
k m(n)
-
k m(n)
-1
b m-1 (n) Z ∑ bm (n)
bm-1 (n-1) +
163
6 Adaptive Filters
The following equations represent the dynamics of the mth stage of an
order M lattice structure as derived from Figure 6.4:
where fm(n) represents the forward prediction error, bm(n) represents the
backward prediction error, Km(n) is the reflection coefficient, m is the stage
index, and M is the number of cascaded stages. Km(n) has a magnitude less
than one. The terms fm(n) and bm(n) are initialized as
Speech analysis is usually performed by using the lattice structure and the
reflection coefficients Km(n). Since the dynamic range of Km(n) is
significantly smaller than that of the tap weights, w(n), of a transversal
filter, the reflection coefficients require fewer bits to represent them.
Hence, Km(n) are transmitted over the channel.
where w(n) represents the tap weights of the transversal filter, e(n) is the
error signal, u(n) represents the tap inputs, and the factor µ is the
adaptation parameter or step-size. To ensure convergence, µ must satisfy
the condition
164
Adaptive Filters 6
where the total input power refers to the sum of the mean-square values of
the tap inputs u(n), u(n-1), ..., u(n-N+1). Moreover, the LMS convergence
time depends on the ratio of maximum to minimum eigenvalues of the
autocorrelation matrix R of the input signal.
x
mu =
uT(n). u(n))
xe(n)u(n)
w(n+1) = w(n) +
r+uT(n)u(n)
A problem can occur when the autocorrelation matrix associated with the
input process has one or more zero eigenvalues. In this case, the adaptive
filter will not converge to a unique solution. In addition, some uncoupled
coefficients (weights) may grow without bound until hardware overflow
or underflow occurs. This problem can be remedied by using coefficient
leakage. This “leaky” LMS algorithm can be written as
where the adaptation constant µ and the leakage coefficient r are a small
positive values.
165
6 Adaptive Filters
For faster rates of convergence, more complex algorithms with additional
parameters must be used. The RLS algorithm uses a least-squares method
to estimate correlation directly from the input data. The LMS algorithm
uses the statistical mean-squared-error method, which is slower.
The RLS algorithm uses a transversal FIR filter implementation. The order
of operations the algorithm takes is
The tap weight update is based on the error signal and the Kalman Gain
Vector and is expressed as
where K is the Kalman Gain Vector and e(n) represents the error signal.
166
Adaptive Filters 6
6.2 IMPLEMENTATIONS
The adaptive filter routines have two inputs
• input sample
• desired response
• filter output
• filter weights.
u(n) y(n) ^
y(n)
Adaptive Decision
Equalizer Device
2
+ - d(n)
∑ Desired
Adaptive 1 Signal
Algorithm
e(n) Position 1: Training Mode
Position 2: Tracking Mode
167
6 Adaptive Filters
6.2.1 Transversal Filter Implementation
The transversal and symmetric transversal FIR filter structures require the
use of a delay line of input samples u(n). The samples u(n), u(n-1), ...,
u(n-N+1) are stored in a circular Data Memory buffer in a reverse order.
The filter weights w0, w1,...,wN – 1 are placed in a circular Program Memory
Data buffer in forward order. Note that u(n + 1) replaces u(n– N + 1) at
time n + 1; therefore, the index of the delay line buffer must point to u(n -
N + 1) before the next filter iteration.
The routine accepts two inputs, the input sample, u(n), and the desired
output, d(n).
N-1
y(n) = ∑ wi(n) u(n – i)
i=0
168
Adaptive Filters 6
6.2.2.1 Code Listing—lms.asm
/
*********************************************************************************
File Name
LMS.ASM
Version
April 2 1991
Purpose
Performs LMS algorithm implemented with a transversal FIR filter structure.
Equations Implemented
******************************************************************
* 1) y(n)= w.u ( . = dot_product), y(n)= FIR filter output *
* where w= [w0(n) w1(n) ... wN-1(n)]= filter weights *
* and u= [u(n) u(n-1) ... u(n-N+1)]= input samples in delay line*
* n= time index, N= number of filter weights (taps) *
* 2) e(n)= d(n)-y(n), e(n)= error signal & d(n)= desired output *
* 3) wi(n+1)= wi(n)+STEPSIZE*e(n)*u(n-i), 0 =<i<= N-1 *
******************************************************************
Calling Parameters
f0= u(n) = input sample
f1= d(n) = desired output
Return Values
f13= y(n)= filter output
f6= e(n)= filter error signal
i8 -> Program Memory Data buffer of the filter weights
Registers Affected
f0, f1, f4, f6, f7, f8, f12, f13
Cycle Count
lms_alg: 3N+8 per iteration, lms_init: 12+N
# PM Locations
pm code= 29 words, pm data= N words
# DM Locations
dm data= N words
*********************************************************************************/
#define TAPS 5
#define STEPSIZE 0.005
.SEGMENT/DM dm_data;
.VAR deline_data[TAPS];
.ENDSEG; (listing continues on next page)
169
6 Adaptive Filters
.SEGMENT/PM pm_data;
.VAR weights[TAPS];
.ENDSEG;
.SEGMENT/PM pm_code;
lms_init: b0=deline_data;
m0=-1;
l0=TAPS; /* circular delay line buffer */
b8=weights;
b9=b8;
m8=1;
l8=TAPS; /* circular weight buffer */
l9=l8;
f7=STEPSIZE;
f0=0.0;
lcntr=TAPS, do clear_bufs until lce;
clear_bufs: dm(i0,m0)=f0, pm(i8,m8)=f0;
rts; /* clear delay line & weights */
.ENDSEG;
170
Adaptive Filters 6
6.2.3 llms.asm—Leaky LMS Algorithm (Transversal)
Implementation of this algorithm is similar to the standard LMS, except
the tap weight update calculation takes into account the constant
LEAK_COEF , such that
File Name
LLMS.ASM
Version
April 2 1991
Purpose
Performs the “leaky” LMS algorithm implemented with a
transversal FIR filter structure
Equations Implemented
*****************************************************************
* 1) y(n)= w.u ( . = dot_product), y(n)= FIR filter output *
* where w= [w0(n) w1(n) ... wN-1(n)]= filter weights *
* and u= [u(n) u(n-1) ... u(n-N+1)]= input samples in delay line*
* n= time index, N= number of filter weights (taps) *
* 2) e(n)= d(n)-y(n), e(n)= error signal & d(n)= desired output *
* 3) wi(n+1)= LEAK_COEF*wi(n)+STEPSIZE*e(n)*u(n-i), 0 =<i<= N-1 *
*****************************************************************
Calling Parameters
f0= u(n)= input sample
f1= d(n)= desired output
Return Values
f13= y(n)= filter output
f6= e(n)= filter error signal
f8 -> Program Memory Buffer of the filter weights
Registers Affected
f0, f1, f2, f4, f6, f7, f8, f9, f12, f13
Cycle Count
llms_alg: 3N+8 per iteration, llms_init: 13+N
# PM Locations
pm code= 30 words, pm data= N words
# DM Locations
dm data= N words (listing continues on next page)
171
6 Adaptive Filters
*********************************************************************************/
#define TAPS 5
#define STEPSIZE 0.0045
#define LEAK_COEF 0.9995
.SEGMENT/DM dm_data;
.VAR deline_data[TAPS];
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR weights[TAPS];
.ENDSEG;
.SEGMENT/PM pm_code;
llms_init: b0=deline_data;
m0=-1;
l0=TAPS; /* circular delay line buffer */
b8=weights;
b9=b8;
m8=1;
l8=TAPS; /* circular weight buffer */
l9=l8;
f7=STEPSIZE;
f2=LEAK_COEF;
f0=0.0;
172
Adaptive Filters 6
lcntr=TAPS-1, do update_weights until lce;
f9=f0*f4, f8=f9+f12, f4=dm(i0,m0), f12=pm(i8,m8); /*
f9=STEPSIZE*e(n)*u(n-i-1), f8=wi(n+1),
f4=u(n-i-2),f12=wi+1(n) */
update_weights: f12=f2*f12, pm(i9,m8)=f8;
/* f12= LEAK_COEF*wi+1(n), store wi(n+1) */ rts(db);
f8=f9+f12, f0=dm(i0,2);
/* f8= wN-1(n+1) i0 -> u(n+1) location in delay line */
pm(i9,m8)=f8; /* store wN-1(n+1) */
.ENDSEG;
173
6 Adaptive Filters
File Name
NLMS.ASM
Version
April 2 1991
Purpose
Performs the normalized LMS algorithm implemented
with a transversal FIR filter structure.
Equations Implemented
*****************************************************************
* 1) y(n)= w.u ( . = dot_product), y(n)= FIR filter output *
* where w= [w0(n) w1(n) ... wN-1(n)]= filter weights *
* and u=[ u(n) u(n-1) ... u(n-N+1)]= input samples in delay line*
* n= time index, N= number of filter weights (taps) *
* 2) e(n)= d(n)-y(n), e(n)= error signal & d(n)= desired output *
* 3) wi(n+1)= wi(n)+normalized_stepsize*e(n)*u(n-i), 0 =<i<= N-1*
* 4) normalized_stepsize= ALPHA/(GAMMA+E(n)) *
* where E(n)= u.u = energy in delay line *
* E(n) is computed recursively as follows *
* 5) E(n)= E(n-1)+u(n)**2-u(n-N)**2 *
*****************************************************************
Calling Parameters
f0= u(n) = input sample
f1= d(n) = desired output
Return Values
f2= y(n)= filter output
f6= e(n)= filter error signal
i8 -> Program Memory Data buffer of the filter weights
Registers Affected
f0, f1, f2, f4, f5, f6, f7, f8, f9, f11, f12, f13, f14
Cycle Count
nlms_alg: 3N+16 per iteration, nlms_init: 14+N
# PM Locations
pm code= 32 words, pm data= N words
# DM Locations
dm data= N words
*********************************************************************************/
174
Adaptive Filters 6
#define TAPS 5
#define ALPHA 0.1
#define GAMMA 0.1
#include “a:\global\macros.h”
.SEGMENT/DM dm_data;
.VAR deline_data[TAPS];
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR weights[TAPS];
.ENDSEG;
.SEGMENT/PM pm_code;
nlms_init: b0=deline_data;
m0=-1;
l0=TAPS; /* circular delay line buffer */
b8=weights;
b9=weights;
m8=1;
l8=TAPS; /* circular weight buffer */
l9=TAPS;
f5=ALPHA;
f11=GAMMA; /* f11= E(0)= GAMMA */
f9= 2.0; /* f9= 2.0 for DIVIDE_ macro */
f13=0.0;
lcntr=TAPS, do clear_bufs until lce;
clear_bufs: dm(i0,m0)=f13, pm(i8,m8)=f13;
/* clear delay line & weights */
rts;
175
6 Adaptive Filters
f12=f0*f4, f8=f8+f12, f14=f11; /* f12= u(n-N+1)*wN-1(n),
f14= E(n) */
f2=f8+f12; /* f2= y(n) */
f6=f1-f2, f4=dm(i0,m0); /* f6= e(n), f4= u(n) */
f7=f6*f5; /* f7= ALPHA*e(n) */
DIVIDE(f1,f7,f14,f9,f0); /* f1= normalized_stepsize*e(n) */
f0=f1*f4, f12=pm(i8,m8); /* f0= f1*u(n), f12= w0(n) */
.ENDSEG;
For the sign-error algorithm, taps are updated according to the relationship
where
sgn{x} = 1 for x ≥ 0
sgn{x} = –1 for x < 0
This function is implemented after the main filter calculation loop, macs , and prior
to the update_weights loop, as the error signal is used across all weights i, for
any given time, n.
176
Adaptive Filters 6
File Name
SELMS.ASM
Version
April 2 1991
Purpose
Performs the sign-error LMS algorithm implemented
with a transversal FIR filter structure
Equations Implemented
********************************************************************
* 1) y(n)= w.u ( . = dot_product) , y(n)= FIR filter output *
* where w= [w0(n) w1(n) ... wN-1(n)]= filter weights *
* and u= [u(n) u(n-1) ... u(n-N+1)]= input samples in delay line *
* n= time index, N= number of filter weights (taps) *
* 2) e(n)= d(n)-y(n), e(n)= error signal & d(n)= desired output *
* 3) wi(n+1)= wi(n)+STEPSIZE*sgn[e(n)]*u(n-i), 0 =<i<= N-1 *
* where sgn[e(n)]= +1 if e(n) >= 0 and -1 if e(n) < 0 *
********************************************************************
Calling Parameters
f0= u(n)= input sample
f1= d(n)= desired output
Return Values
f13= y(n)= filter output
f1= e(n)= filter error signal
i8 -> Program Memory Data buffer of the filter weights
Registers Affected
f0, f1, f2, f4, f7, f8, f12, f13
Cycle Count
selms_alg: 3N+8 per iteration, selms_init: 12+N
# PM Locations
pm code: 29 words, pm data= N words
# DM Locations
dm data= N words
177
6 Adaptive Filters
*********************************************************************************/
#define TAPS 5
#define STEPSIZE 0.005
.GLOBAL selms_init;
.GLOBAL selms_alg;
.SEGMENT/DM dm_data;
.VAR deline_data[TAPS];
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR weights[TAPS];
.ENDSEG;
.SEGMENT/PM pm_code;
selms_init: b0=deline_data;
m0=-1;
l0=TAPS; /* circular delay line buffer */
b8=weights;
b9=b8;
m8=1;
l8=TAPS; /* circular weight buffer */
l9=l8;
f7=STEPSIZE;
f0=0.0;
178
Adaptive Filters 6
lcntr=TAPS-3, do macs until lce;
macs: f12=f0*f4, f8=f8+f12, f0=dm(i0,m0), f4=pm(i8,m8);
rts(db);
f8=f0+f12, f0=dm(i0,1); /* f8= wN-1(n+1) */
/* i0 -> u(n+1) location in delay line */
pm(i9,m8)=f8; /* store wN-1(n+1) */
.ENDSEG;
179
6 Adaptive Filters
previous input values have influence across the taps, i, for a given time, n.
File Name
SDLMS.ASM
Version
April 2 1991
Purpose
Performs the sign-data LMS algorithm implemented with
a transversal FIR filter structure
Equations Implemented
*****************************************************************
* 1) y(n)= w.u ( . = dot_product), y(n)= FIR filter output *
* where w= [w0(n) w1(n) ... wN-1(n)]= filter weights *
* and u= [u(n) u(n-1) ... u(n-N+1)]= input samples in delay line*
* n= time index, N= number of filter weights (taps) *
* 2) e(n)= d(n)-y(n), e(n)= error signal & d(n)= desired output *
* 3) wi(n+1)= wi(n)+STEPSIZE*e(n)*sgn[u(n-i)], 0 =<i<= N-1 *
* where sgn[u(n)]= +1 if u(n) >= 0 and -1 if u(n) < 0 *
*****************************************************************
Calling Parameters
f0= u(n)= input sample
f1= d(n)= desired output
Return Values
f13= y(n)= filter output
f6= e(n)= filter error signal
i8 -> Program Memory Data buffer of the filter weights
Registers Affected
f0, f1, f4, f5, f6, f7, f8, f9, f12, f13
Cycle Count
sdlms_alg: 4N+8 per iteration, sdlms_init: 13+N
# PM Locations
pm code= 32 words, pm data= N words
# DM Locations
180
Adaptive Filters 6
dm data= N words
*********************************************************************************/
#define TAPS 5
#define STEPSIZE 0.005
.SEGMENT/DM dm_data;
.VAR deline_data[TAPS];
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR weights[TAPS];
.ENDSEG;
.SEGMENT/PM pm_code;
sdlms_init: b0=deline_data;
m0=-1;
l0=TAPS; /* circular delay line buffer */
b8=weights;
b9=b8;
m8=1;
l8=TAPS; /* circular weight buffer */
l9=l8;
f7=STEPSIZE;
f5=1.0;
f0=0.0;
lcntr=TAPS, do clear_bufs until lce;
clear_bufs: dm(i0,m0)=f0, pm(i8,m8)=f0;
/* clear delay line & weights */
rts;
181
6 Adaptive Filters
lcntr=TAPS-3, do macs until lce;
macs: f12=f0*f4, f8=f8+f12, f0=dm(i0,m0), f4=pm(i8,m8);
/* f12= u(n-i)*wi(n), f8= sum of prod, f0= u(n-i-1), f4= wi+1(n)
*/
if lt f9=-f1;
rts(db);
f8=f9+f12, f0=dm(i0,2);
182
Adaptive Filters 6
/* f8= wN-1(n+1) i0 -> u(n+1) location in delay line
*/
pm(i9,m8)=f8; /* store wN-1(n+1) */
.ENDSEG;
The error value, e(n), and the input value, u(n), are multiplied together in
the update_weights loop in order to set the appropriate multiplier sign
flags. The resultant flags determine whether the STEPSIZE value is
added or subtracted from the past tap weight.
File Name
SSLMS.ASM
Version
April 2 1991
Purpose
Performs the sign-sign LMS algorithm implemented with
a transversal FIR filter structure
Equations Implemented
*****************************************************************
* 1) y(n)= w.u ( . = dot_product), y(n)= FIR filter output *
* where w= [w0(n) w1(n) ... wN-1(n)]= filter weights *
* and u= [u(n) u(n-1) ... u(n-N+1)]= input samples in delay line*
* n= time index, N= number of filter weights (taps) *
* 2) e(n)= d(n)-y(n), e(n)= error signal & d(n)= desired output *
* 3) wi(n+1)= wi(n)+STEPSIZE*sgn[e(n)]*sgn[u(n-i)], 0 =<i<= N-1 *
* where sgn[x]= +1 if x >= 0 and -1 if x < 0 *
*****************************************************************
Calling Parameters
f0= u(n)= input sample
183
6 Adaptive Filters
f1= d(n)= desired output
Return Values
f13= y(n)= filter output
f1= e(n)= filter error signal
i8 -> Program Memory Data buffer of the filter weights
Registers Affected
f0, f1, f2, f4, f7, f8, f9, f12, f13
Cycle Count
sslms_alg: 4N+7 per iteration, sslms_init: 13+N
# PM Locations
pm code= 31 words, pm data= N words
# DM Locations
dm data= N words
*********************************************************************************/
#define TAPS 5
#define STEPSIZE 0.005
.SEGMENT/DM dm_data;
.VAR deline_data[TAPS];
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR weights[TAPS];
.ENDSEG;
.SEGMENT/PM pm_code;
sslms_init: b0=deline_data;
m0=-1;
l0=TAPS; /* circular delay line buffer */
b8=weights;
b9=b8;
m8=1;
l8=TAPS; /* circular weight buffer */
l9=l8;
f7=STEPSIZE;
f2=1.0;
f0=0.0;
184
Adaptive Filters 6
/* f4=w0(n), store u(n) in delay line */
f8=f0*f4, f0=dm(i0,m0), f4=pm(i8,m8);
/* f8= u(n)*w0(n), f0= u(n-1), f4= w1(n) */
f12=f0*f4, f0=dm(i0,m0), f4=pm(i8,m8);
/* f12= u(n-1)*w1(n), f0= u(n-2), f4= w2(n)
*/
.ENDSEG;
N–1
185
6 Adaptive Filters
y(n) = ∑ wi(n) * [u(n – i) + u(n – M + i + 1)].
i=0
The filter calculation is broken up into two loops. The macs loop is
similar to the one in lms.asm , but this algorithm adds a second
calculation loop, macs1 , to account for the second term in the
sum-of-products. The update_weights loop has an added multiply to
account for the extra term in the weight update equation
File Name
SYLMS.ASM
Version
April 2 1991
Purpose
Even order symmetric transversal filter structure implementation of the LMS
algorithm
Equations Implemented
*****************************************************************
* 1) y(n)= SUM[wi(n)*[u(n-i)+u(n-M+i+1)]] for 0 =<i<= N-1 *
* where y(n)= FIR filter output, wi= filter weights *
* u= input samples in delay line, N= number of weights (taps),*
* n= time index, and M= 2N= length of delay line *
* 2) e(n)= d(n)-y(n), e(n)= error signal & d(n)= desired output *
* 3) wi(n+1)= wi(n)+STEPSIZE*e(n)*[u(n-i)+u(n-M+i+1)], 0=<i<=N-1*
*****************************************************************
Calling Parameters
Calling parameters (inputs):
f0= u(n)= input sample
f1= d(n)= desired output
Return Values
f13= y(n)= filter output
f6= e(n)= filter error signal
186
Adaptive Filters 6
i8 -> Program Memory Data buffer of the filter weights
Registers Affected
f0, f1, f3, f4, f6, f7, f8, f9, f12, f13
Cycle Count
sylms_alg: 2M+10 per iteration, sylms_init: 17+M
# PM Locations
pm code= 39 words, pm data= N words
# DM Locations
dm data= 2N
*********************************************************************************/
#define TAPS 4
#define STEPSIZE 0.005
.SEGMENT/DM dm_data;
.VAR deline_data[2*TAPS];
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR weights[TAPS];
.ENDSEG;
.SEGMENT/PM pm_code;
sylms_init: b0=deline_data;
b3=deline_data;
m0=-1;
m1=1;
m2=TAPS+2;
l0=2*TAPS; /* circular delay line buffer of length M= 2N */
l3=2*TAPS;
b8=weights;
b9=weights;
m8=1;
m9=-1;
m10=-2;
l8=TAPS; /* circular weight buffer */
l9=TAPS;
f7=STEPSIZE;
f0=0.0;
lcntr=2*TAPS, do clear_bufs until lce;
clear_bufs: dm(i0,m0)=f0, pm(i8,m8)=f0;
/* clear delay line & weights */
187
6 Adaptive Filters
rts;
188
Adaptive Filters 6
lcntr=TAPS-1, do update_weights until lce;
f9=f1*f4, f8=f9+f12, f0=dm(i0,m0), f12=pm(i8,m8)
/* f9= STEPSIZE*e(n)*[u(n-i-1)+u(n-M+2+i)], */
/* f8= wi(n+1), f0= u(n-2-i), f12= wi+1(n) */
update_weights: f4=f0+f3, f3=dm(i3,m1), pm(i9,m8)=f8;
/* f4=u(n-2-i)+u(n-M+3+i), f3=u(n-M+4+i), store
wi(n+1) */
rts(db);
f8=f9+f12, f0=dm(i0,m2);
/* i0 ->u(n+1) location in delay line */
pm(i9,m8)=f8;
.ENDSEG;
where 0 < m < M. The tap coefficients are updated using the relationship
189
6 Adaptive Filters
where 0 < m ≤ M.
190
Adaptive Filters 6
m=0
The gradient lattice algorithm provides significantly faster convergence than the
LMS algorithm. Unlike the LMS algorithm, the convergence rate of the gradient
lattice algorithm does not depend on the eigenvalue spread of the autocorrelation
matrix.
In addition to the filter weight, filter error, and filter output, this routine also
provides reflection coefficient, forward prediction error, and backward prediction
error as outputs. The lattice algorithm is implemented in the update_coefs loop.
This loop is STAGES long, where STAGES is a constant set to the length of the
lattice structure (STAGES = 3 in this example).
File Name
LATLMS.ASM
Version
April 9 1991
Purpose
Performs the LMS algorithm implemented with a lattice FIR filter structure
with Joint Process estimation
Equations Implemented
*****************************************************************
* 1) f0(n)= b0(n)= u(n) *
* 2) fm(n)= fm-1(n) - km(n)*bm-1(n-1), 0 <m<= M *
* 3) bm(n)= bm-1(n-1) - km(n)*fm-1(n), 0 <m<= M *
* 4) km(n+1)= km(n) + STEPSIZE*[fm(n)*bm-1(n-1) + bm(n)*fm-1(n)],*
* 0 <m<= M *
* where u(n)= input sample, n= time index, M= Number of stages *
* fm(n)= forward prediction error of the mth stage *
* bm(n)= backward prediction error of the mth stage *
* km(n)= reflection coefficient of the mth stage *
* *
* 5) e0(n)= d(n) - b0(n)*g0(n) *
* 6) em(n)= em-1(n) - bm(n)*gm(n), 0 <m<= M *
* 7) gm(n+1)= gm(n) + STEPSIZE*em(n)*bm(n), 0 <=m<= M *
* 8) y(n)= SUM [gm(n)*bm(n)] for 0 <=m<= M *
* where d(n)= desired output, y(n)= FIR filter output *
* em(n)= output error at the mth stage *
191
6 Adaptive Filters
* gm(n)= filter weight at the mth stage *
******************************************************************
Calling Parameters
f0= u(n)= input sample
f9= d(n)= desired output
Return Values
f0= bm(n)= mth backward prediction error
f2= fm(n)= mth forward prediction error
f6= em(n)= mth output error
f12= y(n)= filter output
i8 -> Program Memory Data buffer of the reflection coefficients
i10 -> Program Memory Data buffer of the filter weights
Registers Affected
f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f12, f13
Cycle Count
LATLMS_ALG: 7+11M per iteration , LATLMS_INIT: 16+2M
# PM Locations
pm code= 36 words, pm data= 2M+1
# DM Locations
dm data= M+1
*********************************************************************************/
#define STAGES 3
#define STEPSIZE 0.005
.SEGMENT/DM dm_data;
.VAR bpe_coef[STAGES+1];
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR weights[STAGES+1];
.VAR ref_coef[STAGES];
.ENDSEG;
.SEGMENT/PM pm_code;
192
Adaptive Filters 6
latlms_init: b0=bpe_coef;
m0=0;
m1=1;
l0=STAGES+1;
b8=ref_coef;
b10=weights;
m8=0;
m9=1;
l8=STAGES;
l10=STAGES+1;
f7=STEPSIZE;
f0=0.0;
lcntr=STAGES, do clear_bufs until lce;
dm(i0,m1)=f0, pm(i8,m9)=f0;
clear_bufs: pm(i10,m9)=f0;
rts(db);
dm(i0,m1)=f0, pm(i10,m9)=f0;
nop;
193
6 Adaptive Filters
/* f12= y(n) of mth stage, f5= km+1(n) */
f13=f1*f7, f6=f10-f13, f1=f9;
/* f6= em(n), f13= stepsize*f1, f1= bm(n-1) */
f13=f1*f5, f8=f8+f13, f10=f2;
/* f13= bm(n-1)*km+1(n), f8= km(n+1), f10= fm(n) */
update_coefs: f3= f0*f6, pm(i8,2)=f8;
/* f3= bm(n)*em(n), store km(n+1) */
rts(db), f3=f3*f7;
/* f3= stepsize*em(n)*bm(n) */
f4=f3+f4, f3=pm(i8,-1);
/* f4= gm(n+1), i8 -> k1(n+1) */
pm(i10,m9)=f4;
/* store gm(n+1) */
.ENDSEG;
194
Adaptive Filters 6
The routine calculates the transversal filter output, y(n), much like the lms.asm
routine. The Kalman Gain Vector is computed in two steps. The first step computes
an intermediate value, x. Since x deals with the N-by-N matrix of input
autocorrelation data, it is calculated using a nested loop. The second step takes x and
performs a dot product with the input vector, u, in the mac3 loop, to create k0(n),
the first element in the vector. The subsequent vector values, ki(n), along with the tap
weight vector, wi(n), are computed in the loop comp_kn_wn . Finally, the
autocorrelation matrix, Z, is updated in a nested loop, where the inner loop updates
the rows, and the outer loop updates the columns.
During initialization, the Z-matrix is set up using the forgetting factor to create the
time decay structure. The forgetting factor is set as a constant, FORGET_FACT .
File Name
RLS.ASM
Version
April 18 1991
Purpose
Performs the Recursive Least-Squares (RLS) algorithm
implemented with a transversal FIR filter structure
Equations Implemented
*********************************************************************
* 1) x= s*Z(n-1).u , where s= 1/forgetting factor, n= time index *
* u= [u(n) u(n-1) ... u(n-N+1)]= input samples in delay line *
* Z is an N-by-N matrix, N= number of filter weights *
* 2) k= x/[1+u.x] where k is an N-by-1 vector *
* 3) Z(n)= s*Z(n-1) - k.xT , xT is the transpose of vector x *
* 4) e(n)= d(n) - w(N-1).u , e(n)= filter “a priori” error signal *
* w= [w0 w1 ... wN-1]= filter weights, d(n)= desired output *
* 5) w(n)= w(n-1) + k*e(n) *
*********************************************************************
Calling Parameters
f0, f1, f2, f3, f4, f5, f8, f9, f10, f11, f12, f13, f14
Return Values
195
6 Adaptive Filters
f13= y(n)= filter output
f1= e(n)= filter “a priori” error signal
i0 -> Data Memory Data buffer of the filter weights
Registers Affected
f0, f1, f2, f4, f7, f8, f9, f12, f13
Cycle Count
rls_alg: 3N**2+9N+20 per iteration, rls_init: N**2+3N+25
# PM Locations
pm code= 81 words, pm data= 2N words
# DM Locations
dm data= N**2+2N
*********************************************************************************/
#define TAPS 5
#define FORGET_FACT 0.9
#define INIT_FACT 1000.
#include “b:\global\macros.h”
.SEGMENT/DM dm_data;
.VAR weights[TAPS];
.VAR zmatrix[TAPS*TAPS];
.VAR xvector[TAPS];
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR deline_data[TAPS];
.VAR kvector[TAPS];
.ENDSEG;
.SEGMENT/PM pm_code;
rls_init: b0=weights;
l0=TAPS;
b1=zmatrix;
l1=TAPS*TAPS;
b3=b1;
l3=l1;
b2=xvector;
l2=TAPS;
m0=1;
m2=0;
m3=-3;
b8=deline_data;
l8=TAPS;
b9=kvector;
l9=TAPS;
196
Adaptive Filters 6
m8=-1;
m9=1;
m11=3;
f10=2.0;
f14=1.0;
f5=1/FORGET_FACT;
f0=0.0;
f1=INIT_FACT;
lcntr=TAPS, do clear_bufs until lce;
dm(i0,m0)=f0, pm(i8,m9)=f0;
clear_bufs: dm(i2,m0)=f0, pm(i9,m9)=f0;
dm(i1,m0)=f1;
lcntr=TAPS-1, do init_zmatrix until lce;
lcntr=TAPS, do clear_zel until lce;
clear_zel: dm(i1,m0)=f0;
init_zmatrix: dm(i1,m0)=f1;
rts;
f4=dm(i1,m3), f0=pm(i8,m11);
(listing continues on next page)
197
6 Adaptive Filters
/* i1 -> Z0(0,n-1), i8 -> u(n) */
f4=dm(i2,m0), f0=pm(i8,m8);
/* f4= x0(n), f0= u(n) */
f8=f0*f4, f4=dm(i2,m0), f0=pm(i8,m8);
/* f8=x0(n)*u(n), f4=x1(n), f0=u(n-1) */
f12=f0*f4, f8=f8+f14, f4=dm(i2,m0), f0=pm(i8,m8);
/* f12=x1(n)*u(n-1), f8=1+x0(n)*u(n),f4=x2(n), f0=u(n-
2) */
198
Adaptive Filters 6
lcntr=TAPS-1, do update_zn until lce;
f12=f2*f4, f0=f8-f12, f8=dm(i1,m0);
lcntr=TAPS-2, do update_zrow until lce;
f3=f0*f5, f0=f8-f12, f8=dm(i1,m0), f4=pm(i9,m9);
update_zrow: f12=f2*f4, dm(i3,m0)=f3;
f3=f0*f5, f0=f8-f12, f2=dm(i2,m0);
f0=f0*f5, dm(i3,m0)=f3, f4=pm(i9,m9);
f12=f2*f4, dm(i3,m0)=f0, f4=pm(i9,m9);
update_zn: f8=dm(i1,m0);
f12=f2*f4, f0=f8-f12, f8=dm(i1,m0);
lcntr=TAPS-2, do update_zlastrow until lce;
f3=f0*f5, f0=f8-f12, f8=dm(i1,m0), f4=pm(i9,m9);
update_zlastrow: f12=f2*f4, dm(i3,m0)=f3;
f3=f0*f5, f0=f8-f12, dm(i0,m0)=f11;
rts(db);
f0=f0*f5, dm(i3,m0)=f3;
dm(i3,m0)=f0;
.ENDSEG;
File Name
TESTAFA.ASM
Version
April 2 1991
Purpose
This is a testing shell for the Adaptive Filtering Algorithms.
(listing continues on next page)
199
6 Adaptive Filters
This file has to be edited to conform with the algorithm employed.
Equations Implemented
***********************************************************************
* This program generates a moving average sequence of real
samples, *
* and employs the adaptive filter for System Identification based
on *
* a transversal FIR filter structure. The generating plant is
*
* described by the following difference equation: *
* y(n)= x(n-1)-0.5x(n-2)-x(n-3)
*
* where the plant impulse response is 0, 1, -0.5, -1, 0, 0, ... .
*
* The filter is allowed five weight coefficients. The input data
*
* sequence is a pseudonoise process with a period of 20.
*
* This program is also used to test adaptive filters based on
both *
* symmetric transversal FIR and lattice FIR structures. The output
*
* filter error signal is stored in a data buffer named [flt_err]
*
* for comparison.
*
* Place * s in this file with the corresponding code
*
***********************************************************************
*********************************************************************************/
#define SAMPLES **
/* ** = 200 for RLS and 1000 for various LMS */
.SEGMENT/DM dm_data;
.VAR flt_err[SAMPLES];
.VAR input_data[20]= 0.038, -0.901, 0.01, -0.125, -1.275, 0.877,
200
Adaptive Filters 6
-0.881,
0.930, 1.233, -1.022, 1.522, -0.170, 1.489, -1.469,
1.068, -0.258, 0.989, -2.891, -0.841, -0.355;
.GLOBAL flt_err;
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR plant_hn[3]= -1.0, -0.5, 1.0;
.ENDSEG;
.SEGMENT/PM rst_svc;
dmwait=0x21;
pmwait=0x21;
jump begin;
.ENDSEG;
.SEGMENT/PM pm_code;
begin: b6=flt_err;
l6=0;
b7=input_data;
201
6 Adaptive Filters
l7=20;
b15=plant_hn;
l15=3;
m15=1;
m7=1;
m6=-3;
202
Adaptive Filters 6
6.3 CONCLUSION
Table 6.1 lists the number of instruction cycles and Memory Usage usage
the memory
relating to the various LMS algorithms implemented with a transversal
FIR filter structure. It is interesting to note that the sign-LMS algorithms,
originally designed to ease implementation in fixed-function silicon,
require more instruction cycles in a programmable DSP due to their sign
checking routines.
Memory Usage
Algorithm Cycles per PM Code PM Data DM Data
Iteration
LMS 3N + 8 29 N N
“Leaky” LMS 3N + 8 30 N N
Normalized LMS 3N + 16 40 N N
Sign-Error LMS 3N + 8 29 N N
Sign-Data LMS 4N + 8 32 N N
Sign-Sign LMS 4N + 8 31 N N
Table 6.1 Transversal FIR LMS Performance & Memory Benchmarks For Filters Of Order
N
Table 6.2 shows the performance of the LMS algorithm implemented with
three different FIR filter structures. The final application will determine
the structure used.
Memory Usage
Structure Cycles per PM Code PM Data DM Data
Iteration
Transversal 3N + 8 29 N N
Symmetric Transversal 2N + 10 39 0.5N N
Lattice-Ladder 11N + 7 36 2N + 1 N+1
This chapter gives an overview of the DFT algorithm and discusses the
features of the ADSP-21000 family architecture that enable fast execution
of FFTs. It presents implementations of the DFT when its size is a power of
two (a radix-2 FFT) and when its size is a power of four (a radix-4 FFT).
Details of these implementations are discussed, including optimization, bit
and digit-reversal, coefficient generation, and inverse transforms. This
chapter also provides benchmarks and code listings of implementations of
the algorithms.
205
7 Fourier Transforms
7.1 COMPUTATION OF THE DFT
The DFT resembles the correlation operation in that it measures the
similarity of an unknown signal with a complex exponential. The resulting
spectrum yields the complex information (phase and amplitude) for N
frequencies. The resulting values are commonly called frequency bins
because they fill up with amplitude information for each frequency.
N-1
206
Fourier Transforms 7
7.1.1 Derivation Of The Fast Fourier Transform
The basis of the FFT is that a DFT can be divided into smaller DFTs. A
radix-2 FFT divides the FFT DFT into two smaller DFTs, each of which is
divided into two smaller DFTs, and so on, resulting in a combination of
two-point DFTs.
In a similar fashion, a radix-4 FFT divides the DFT into four smaller DFTs,
each of which is divided into four smaller DFTs, and so on, resulting in a
combination of four-point DFTs.
Two methods are used repeatedly to split the DFTs into smaller (two-
point or four-point) core calculations:
See the references listed at the end of this chapter for more detail on the
derivation of these and other FFT algorithms.
207
7 Fourier Transforms
7.1.2 Butterfly Calculations
The two-point DFT at the core of a radix-2 DIT FFT is called a butterfly
calculation. The equations for the radix-2 DIT butterfly are:
208
Fourier Transforms 7
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
x(0) X(0)
x(16) X(1)
W0 -1
x(8) X(2)
W0 -1
x(24) X(3)
W0 -1 W8 -1
x(4) X(4)
W0 -1
x(20) X(5)
W0 -1 W4 -1
x(12) X(6)
W0 -1 W8 -1
x(28) X(7)
W0 -1 W8 -1 W12 -1 -1
x(2) X(8)
W0 -1
x(18) X(9)
W0 -1 W2 -1
x(10) X(10)
W0 -1 W4 -1
x(26) X(11)
W0 -1 W8 -1 W6
x(6) X(12)
W0 -1 W8 -1
x(22) X(13)
W0 -1 W4 -1 W10 -1
x(14) X(14)
W0 -1 W8 -1 W12 -1
x(30) X(15)
W0 -1 W8 -1 W12 -1 W14 -1
x(1) X(16)
W0 -1
x(17) X(17)
W0 -1 W1 -1
x(9) X(18)
W0 -1 W2 -1
x(25) X(19)
W0 -1 W8 -1 W3 -1
x(5) X(20)
W0 -1 W4 -1
x(21) X(21)
W0 -1 W4 -1 W5 -1
x(13) X(22)
W0 -1 W8 -1 W6 -1
x(29) X(23)
W0 -1 W8 -1 W12 -1 -1 W7 -1
x(3) X(24)
W0 -1 W8 -1
x(19) X(25)
W0 -1 W2 -1 W9 -1
x(11) X(26)
W0 -1 W4 -1 W10 -1
x(27) X(27)
W0 -1 W8 -1 W6 W11 -1
x(7) X(28)
W0 -1 W8 -1 W12 -1
x(23) X(29)
W0 -1 W4 -1 W10 -1 W13 -1
x(15) X(30)
W0 -1 W8 -1 W12 -1 W14 -1
x(31) X(31)
W0 -1 W8 -1 W12 -1 W14 -1 W15 -1
210
Fourier Transforms 7
The Radix-4 FFT algorithm, in particular, requires that the computation
units keep many intermediate values. The ADSP-210xx’s 16-location
register file prevents bottlenecks from forming between the computation
units and memory. It can store all intermediate values needed for an
efficient 14 cycle radix-4 butterfly.
Listing 7.1 shows an example architecture file for the ADSP-21020 which
supports bit-reversal. It assigns the segments dm_rdat and dm_idat
to support bit-reversal of the real and imaginary data for FFTs up to 16K
points long.
211
7 Fourier Transforms
The radix-2 program in Listing 7.2 reads the input arrays, redata and
imdata , using bit-reversed addressing. The .SEGMENT directive is used
to place these arrays at absolute locations in the dm_rdat and dm_idat
segments.
The radix-4 program in Listing 7.3 writes to the output arrays, refft
and imfft , using bit-reversed addressing. The .SEGMENT directive is
used to place these arrays at absolute locations in the dm_rdat and
dm_idat segments.
The first two stages are performed together in a single loop. They are
combined into a loop that implements what is essentially a radix-4
butterfly. The twiddle factor for this butterfly is WN0 = cos(0) – j sin(0) = 1;
therefore no multiplications are required. The first two stages execute
quickly because there is no group overhead and because the no-multiply
radix-4 butterfly requires an equivalent of two cycles per radix-2 butterfly
compared to the normal four cycles required by the middle stages.
The addressing of data input to all butterflies is from the bottom of the
working array to the top, in other words, they are accessed backwards.
The flow graph for the FFT is structured so that all butterflies within a
particular group require the same twiddle factors. As a result, a new value
must be read into the register file only at the start of each new group,
which improves the FFT’s performance by about 20%.
Most of the FFT execution time is spent within the middle stage
butterflies. The ADSP-210xx architecture executes these butterflies very
efficiently. They require only four cycles each to execute, because a
multiply and add operation is performed in all four cycles and a subtract
is done on two of the four cycles. The two address generators support four
memory reads and four memory writes in each butterfly with a total of
eight address pointer updates.
212
Fourier Transforms 7
The second to the last stage has only two butterflies per group. It is
implemented with one loop to reduce the overhead between groups. Each
butterfly takes only 4.5 cycles in this stage.
The last stage has only one butterfly per group, is implemented within one
loop, and takes five cycles per butterfly.
The radix-4 program is broken up into a minimum of three stages: the first
stage, the middle stages, and the last stage. Since there is a minimum of
three stages, the smallest FFT calculable in this implementation is
N=43=64 points.
213
7 Fourier Transforms
The following equations are used in the ADSP-210xx implementation of
the radix-4 DIF butterfly:
See the references at the end of this chapter for more detail on radix-4 FFT
structures.
The radix-2 FFT requires two twiddle factor tables stored in memory.
These tables consist of 1/2 cycle of a sine wave and a cosine wave. Each
table is N/2 samples long. A C language utility for generating the two
tables, called TWIDRAD2.C , is shown in Listing 7.4.
The radix-4 FFT also requires two twiddle factor tables stored in memory.
They are different from the radix-2 tables in that they are stored in a bit-
reversed-like order. These tables consist of 3N/4 samples. A C language
utility for generating the two tables called twidrad4.c , is shown in
Listing 7.5. Radix-4 tables generated for an N-point FFT also can be used
for any radix-4 FFT of length less than N, because of the bit-reversed
nature of the radix-4 tables.
214
Fourier Transforms 7
7.4 INVERSE COMPLEX FFTs
The inverse relationship for obtaining a sequence from its DFT is called
the inverse DFT (IDFT). The transformation is described by the equation
N-1
x(n) = 1/N ∑ X(k) ej2πnk/N n = 0 to N – 1
k=0
You can use the complex FFTs algorithms described in this chapter to
implement both the IDFT and the DFT. The only difference between the
two transformations is the normalization factor 1/N and the phase signs of
the twiddle factors. Consequently, an FFT algorithm for computing the
DFT are converted into an IFFT algorithm by using a reversed (upside
down) twiddle factor table. Because the cosine table is symmetrical about
zero, only the sine table must be reversed.
The data swapping steps need no overhead if they are incorporated into
data transfers that are already required.
215
7 Fourier Transforms
7.5 BENCHMARKS
Complex Radix-2 FFT with Bit -Reversal
Memory Usage:
pm code = 158 words, pm data = 1.5 * N words, dm data = 3.5 * N
words
Memory Usage:
pm code = 192 words, pm data = 1.75 * N words, dm data = 3.75 * N
words
216
Fourier Transforms 7
7.6 CODE LISTINGS
7.6.1 FFT.ACH - Architecture File
! FFT.ACH
! Example (ADSP-21020) architecture file for FFTs. Supports proper multiple
! of N base addresses for bit reversing of real and imaginary arrays of sizes
! up to 16K words.
.SYSTEM fft;
.PROCESSOR = ADSP21020;
.ENDSYS;
217
7 Fourier Transforms
7.6.2 FFTRAD2.ASM - Complex Radix2 FFT
/*___________________________________________________________________________
FFTRAD2.ASM ADSP-21020 Radix-2 DIT Complex Fast Fourier Transform
Calculates a radix-2 FFT. The FFT length (N) must be a power of 2 and a
minimum of 32 points. Input data is not destroyed during the course of this
routine. The input and output arrays are normal ordered. The real array is
stored in DM, the imaginary array is stored in PM. The real twiddle factors
are in an N/2 long Cosine table stored in PM, and the imaginary twiddle
factors are in an N/2 long Sine Table in stored in DM. The twiddle factors
are generated by the program TWIDRAD2.
To implement a inverse FFT, one only has to (1) swap the real and imaginary
of the incoming data, (2) take the forward FFT, (3) swap the real and
imaginary of the outgoing data, and (4) scale the data by 1/N.
Version:
10-SEP-90 Original
25-APR-91 Revision
26-MAY-93, in FSTAGE drain pipe without dummy dm access
14-DEC-93, Cleaned up format, added benchmarks
Calling Parameters:
pm(cosine[N/2]) - real twiddle factors from TWIDRAD2 program
dm(sine[N/2]) - imaginary twiddle factors from TWIDRAD2 program
dm(redata[N]) - real input array, bitreversed to a working array
dm(imdata[N]) - imaginary input array, bitreversed to a working array
(Note: Because the bit reversed address mode is used with the arrays
refft and imfft, they must start at addresses that are integer
multiples of the length (N) of the transform, (i.e. 0,N,2N,3N,...).
This is accomplished by specifing two segments starting at those addresses
in the architecture file and placing the variables alone in their
respective segments. These addresses must also be reflected in the
preprocessor variables ORE and OIM in bit reversed format.)
Return Values:
dm(refft[N]) - real working array and output
dm(imfft[N]) - imaginary working array and output
Altered Registers:
Most I, M, L, and R registers.
Three levels of loop nesting.
218
Fourier Transforms 7
Benchmarks: Radix-2, complex with bit reversal
Memory Usage:
pm code = 158 words, pm data = 1.5*N words, dm data = 3.5*N words
____________________________________________________________________________*/
.SEGMENT/DM dm_data;
.VAR sine[N/2]= “ts2.dat”; /*imag twiddle factors, from TWIDRAD2 */
.VAR refft[N]; /* real result */
.GLOBAL refft;
.ENDSEG;
219
7 Fourier Transforms
.SEGMENT/DM dm_idat; /* This segment is an integer multiple of N */
.VAR imdata[N]= “inimag.dat”; /* input image array */
.GLOBAL imdata;
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR cosine[N/2]= “tc2.dat”; /* real twiddle factors, from TWIDRAD2 */
.VAR imfft[N]; /* imag result */
.GLOBAL imfft;
.ENDSEG;
.SEGMENT/PM pm_code;
/*_______________________________begin FFT___________________________________*/
fftrad2:
bit set mode1 BR0; /* enable bit reverse of i0 */
B0=OIM; /* Points to input imaginary array */
l0=0;
m0=BRMODIFY; /* Modifier for bitreverse counter*/
b8=imfft;
l8=N; /* Circ pointer limits loopend pointer overflow */
b10=imfft;
l10=N; /* Circ pointer limits loopend pointer overflow */
220
Fourier Transforms 7
/*Do the first two stages (actually a radix-4 FFT stage)*/
f0=dm(i0,m0), f1=pm(i8,m8);
f2=dm(i0,m0), f3=pm(i8,m8);
f0=f0+f2, f2=f0-f2, f4=dm(i0,m0), f5=pm(i8,m8);
f1=f1+f3, f3=f1-f3, f6=dm(i0,m0), f7=pm(i8,m8);
f4=f6+f4, f6=f6-f4;
f5=f5+f7, f7=f5-f7;
f8=f0+f4, f9=f0-f4;
f10=f1+f5, f11=f1-f5;
b0=refft;
l0=N; /* Circ pointer limits loopend pointer overflow */
b1=sine;
l1=@sine;
b9=cosine;
l9=@cosine;
b11=imfft;
l11=N; /* Circ pointer limits loopend pointer overflow */
m0=-BFLY8;
m1=-N/8;
m2=-BFLY8-1;
m9=-N/8;
m11=-1;
r2=2;
r3=-BFLY8; /*initializes m0,10 - incr for butterf branches*/
r5=BFLY8; /*counts # butterflies per a group */
r9=(-2*BFLY8)-1; /*initializes m12 - wrap around to next grp + 1*/
r10=-2*BFLY8; /*initializes m8 - incr between groups */
r13=-BFLY8-1; /*initializes m2,13 - wrap to bgn of 1st group */
r15=N/8; /*# OF GROUPS IN THIRD STAGE*/
f0=dm(i1,m1), f7=pm(i8,m8);
f12=f0*f7, f6=dm(i0,m0), f1=pm(i9,m9);
f8=f1*f6, modify(i11,m10);
f11=f1*f7, f7=pm(i8,m8);
f14=f0*f6, f12=f8+f12, f8=dm(i0,m0);
f12=f0*f7, f13=f8+f12, f10=f8-f12, f6=dm(i0,m0);
222
Fourier Transforms 7
r4=r15+r2, i1=b1; /*PREPARE R4 FOR #OF BFLIES CALC*/
r15=ashift r4 by -1; /*# OF BFLIES/GRP IN NEXT STAGE*/
r4=-r15, i9=b9;
m1=r4; /*update inc for sin & cos */
m9=r4;
r5=ashift r5 by 1, f1=dm(i1,m1); /*update # bttrfly in a grp*/
r3=-r5; /* inc for bttrfly branch*/
r13=r3-1, m0=r3; /* wrap to 1st grp */
r10=ashift r3 by 1, f7=pm(i9,m9); /* inc between grps */
end_stage: r9=r10-1, m2=r13; /* wrap to grp +1 */
i0=refft+N-1;
i1=sine+(N/2)-2; /*pntr to 1st sine coeff */
i2=refft+N-1;
i8=imfft+N-1;
i9=cosine+(N/2)-2; /*pntr to 1st cosine coeff */
i10=imfft+N-1;
i11=imfft+N-1;
f0=dm(i1,m1), f7=pm(i8,m8);
f12=f0*f7, f6=dm(i0,m0), f1=pm(i9,m9);
f8=f1*f6, modify(i11,m10);
f11=f1*f7, f7=pm(i8,m12);
f14=f0*f6, f12=f8+f12, f8=dm(i0,m0);
f12=f0*f7, f13=f8+f12, f10=f8-f12, f6=dm(i0,m0);
223
7 Fourier Transforms
/* The last stage */
m0=-N/2;
m2=-N/2-1;
m10=m0;
m13=m2;
i0=refft+N-1;
i1=sine+(N/2)-1; /*pntr to 1st sine coeff */
i2=refft+N-1;
i8=imfft+N-1;
i9=cosine+(N/2)-1; /*pntr to 1st cosine coeff */
i10=imfft+N-1;
i11=imfft+N-1;
m1=-1; /*modifiers to coeff tables */
m9=-1;
/*do N/2 bttrflys in the last stage, one butterfly per group*/
rts; /*finished*/
/*_______________________________________________________________________*/
.ENDSEG;
224
Fourier Transforms 7
7.6.3 FFTRAD4.ASM - Complex Radix-4 FFT
/*_____________________________________________________________________________
FFTRAD4.ASM ADSP-21020 Radix-4 Complex Fast Fourier Transform
This routine performs a complex, radix 4 Fast Fourier Transform (FFT). The FFT
length (N) must be a power of 4 and a minimum of 64 points. The real part of
the input data is placed in DM and the complex part in PM. This data is
destroyed during the course of the computation. The real and complex output of
the FFT is placed in separate locations in DM.
Since this routine takes care of all necessary address digit-reversals, the
input and output data are in normal order. The digit reversal is accomplished
by using a modified radix 4 butterfly throughout which swaps the inner two
nodes resulting with bit reversed data. The digit reversal is completed by
bit reversing the real data in the final stage and then bit reversing the
imaginary so that it ends up in DM.
To implement an inverse FFT, you only have to (1) swap the incoming datas real
and imaginary parts, (2) run the forward FFT, (3) swap the outgoing datas
real and imaginary parts and (4) scale the data by 1/N.
For this routine to work correctly, the program “twidrad4.C” must be used to
generate the special twiddle factor tables for this program.
Calling Parameters:
costwid table at DM : cosine length 3*N/4
sintwid table at PM : sine length 3*N/4
real input at DM : redata length N, normal order
imag input at PM : imdata length N, normal order
Return Values:
real output at DM : refft length N, normal order
imag output at DM : imfft length N, normal order
(Note: Because the bit reversed addressing mode is used with the arrays
refft and imfft, they must start at addresses that are integer
multiples of the length (N) of the transform, (i.e. 0,N,2N,3N,...).
This is accomplished by specifying two segments starting at those addresses
in the architecture file and placing the variables alone in their
respective segments. These addresses must also be reflected in the
preprocessor variables ORE and OIM in bit reversed format.)
Altered Registers:
All I, M, L and R registers.
Three levels of looping.
225
7 Fourier Transforms
Benchmarks: radix-4, complex with digit reversal
Memory Usage:
pm code = 192 words, pm data = 1.75*N words, dm data = 3.75*N words
_____________________________________________________________________________*/
.SEGMENT/DM dm_data;
.VAR cosine[3*N/4]=”tc4.dat”;/*Cosine twiddle factors, from FFTTR4TBL prog*/
.VAR redata[N]=”inreal.dat”; /* Input real data */
.GLOBAL redata;
.ENDSEG;
.SEGMENT/PM pm_data;
.VAR sine[3*N/4]=”ts4.dat”; /* Sine twiddle factors, from FFTTR4TBL prog*/
.VAR imdata[N]=”inimag.dat”; /* Input imaginary data */
.GLOBAL imdata;
.ENDSEG;
226
Fourier Transforms 7
.SEGMENT/PM rst_svc; /* program starts at the reset vector */
pmwait=0x0021; /*pgsz=0,pmwtstates=0,intrn.wtstates only*/
dmwait=0x008421; /*pgsz=0,dmwtstates=0,intrn.wtstates only*/
call fft;
stop: idle;
.ENDSEG;
.SEGMENT/PM pm_code;
fft:
/*_______first stage radix-4 butterfly without twiddles______*/
i0=redata;
i1=redata+N/4;
i2=redata+N/2;
i3=redata+3*N/4;
i4=i0;
i5=i1;
i6=i2;
i7=i3;
m0=1;
m8=1;
i8=imdata;
i9=imdata+N/4;
i10=imdata+N/2;
i11=imdata+3*N/4;
i12=i8;
i13=i9;
i14=i10;
i15=i11;
l0 = 0;
l1 = l0;
l2 = l0;
l3 = N; /* circular to prevent pointer overflow */
l4 = l0;
l5 = l0;
l6 = l0;
l7 = l0;
l8 = l0;
l9 = l0;
l10 = l0;
l11 = N; /* circular to prevent pointer overflow */
l12 = l0;
l13 = l0;
l14 = l0;
l15 = l0;
f0=dm(i0,m0), f1=pm(i8,m8);
f2=dm(i2,m0), f3=pm(i10,m8);
f0=f0+f2, f2=f0-f2, f4=dm(i1,m0), f5=pm(i9,m8);
f1=f1+f3, f3=f1-f3, f6=dm(i3,m0), f7=pm(i11,m8);
f4=f6+f4, f6=f6-f4;
f5=f5+f7, f7=f5-f7;
f8=f0+f4, f9=f0-f4;
f10=f1+f5, f11=f1-f5;
227
7 Fourier Transforms
lcntr=N/4, do fstage until lce; /* do N/4 simple radix-4 butterflies */
f12=f2+f7, f13=f2-f7, f0=dm(i0,m0), f1=pm(i8,m8);
f14=f3+f6, f15=f3-f6, f2=dm(i2,m0), f3=pm(i10,m8);
f0=f0+f2, f2=f0-f2, f4=dm(i1,m0), f5=pm(i9,m8);
f1=f1+f3, f3=f1-f3, f6=dm(i3,m0), f7=pm(i11,m8);
f4=f6+f4, f6=f6-f4, dm(i4,m0)=f8, pm(i12,m8)=f10;
f5=f5+f7, f7=f5-f7, dm(i5,m0)=f9, pm(i13,m8)=f11;
f8=f0+f4, f9=f0-f4, dm(i6,m0)=f12, pm(i14,m8)=f14;
fstage:
f10=f1+f5, f11=f1-f5, dm(i7,m0)=f13, pm(i15,m8)=f15;
r8=redata;
r9=imdata;
228
Fourier Transforms 7
lcntr=m5, do mgroup until lce; /* do m5 groups */
f0=dm(i7,m0), f5=pm(i9,m8);
f8=f0*f5, f4=dm(i1,m0), f1=pm(i15,m8);
f9=f0*f4;
f12=f1*f5, f0=dm(i7,m0), f5=pm(i11,m8);
f13=f1*f4, f12=f9+f12, f4=dm(i3,m0), f1=pm(i15,m8);
f8=f0*f4, f2=f8-f13;
f13=f1*f5;
f9=f0*f5, f8=f8+f13, f0=dm(i7,m1), f5=pm(i10,m8);
f13=f1*f4, f12=f8+f12, f14=f8-f12, f4=dm(i2,m0), f1=pm(i15,m9);
f11=f0*f4;
f13=f1*f5, f6=f9-f13;
f9=f0*f5, f13=f11+f13, f11=dm(i0,0);
f13=f1*f4, f8=f11+f13, f10=f11-f13;
/*___________Do m7 radix-4 butterflies___________*/
lcntr=m7, do mr4bfly until lce;
f13=f9-f13, f4=dm(i1,m0), f5=pm(i9,m8);
f2=f2+f6, f15=f2-f6, f0=dm(i7,m0), f1=pm(i15,m8);
f8=f0*f4, f3=f8+f12, f7=f8-f12, f9=pm(i8,0);
f12=f1*f5, f9=f9+f13, f11=f9-f13, f13=f2;
f8=f0*f5, f12=f8+f12, f0=dm(i7,m0), f5=pm(i11,m8);
f13=f1*f4, f9=f9+f13, f6=f9-f13, f4=dm(i3,m0), f1=pm(i15,m8);
f8=f0*f4, f2=f8-f13, dm(i0,m0)=f3, pm(i8,m8)=f9;
f13=f1*f5, f11=f11+f14, f7=f11-f14, dm(i4,m0)=f7, pm(i12,m8)=f6;
f9=f0*f5, f8=f8+f13, f0=dm(i7,m1), f5=pm(i10,m8);
f13=f1*f4, f12=f8+f12, f14=f8-f12, f4=dm(i2,m0), f1=pm(i15,m9);
f11=f0*f4, f3=f10+f15, f8=f10-f15, pm(i13,m8)=f11;
f13=f1*f5, f6=f9-f13, dm(i6,m0)=f8, pm(i14,m8)=f7;
f9=f0*f5, f13=f11+f13, f11=dm(i0,0);
mr4bfly:
f13=f1*f4, f8=f11+f13, f10=f11-f13, dm(i5,m0)=f3;
/*___________End radix-4 butterfly_____________*/
/* dummy for address update * * */
f13=f9-f13, f0=dm(i7,m2), f1=pm(i15,m10);
f2=f2+f6, f15=f2-f6, f0=dm(i1,m4), f1=pm(i9,m12);
f3=f8+f12, f7=f8-f12, f9=pm(i8,0);
f9=f9+f13, f11=f9-f13, f0=dm(i2,m4);
f9=f9+f2, f6=f9-f2, f0=dm(i3,m4), f1=pm(i10,m12);
dm(i0,m3)=f3, pm(i8,m11)=f9;
f11=f11+f14, f7=f11-f14, dm(i4,m3)=f7, pm(i12,m11)=f6;
f3=f10+f15, f8=f10-f15, pm(i13,m11)=f11;
dm(i6,m3)=f8, pm(i14,m11)=f7;
mgroup: dm(i5,m3)=f3, f1=pm(i11,m12);
r3=m4;
r1=m5;
r2=m6;
r3=ashift r3 by -2; /* groupstep/4 */
r1=ashift r1 by 2; /* groups*4 */
m5=r1;
mstage: r2=ashift r2 by -2; /* butterflies/4 */
229
7 Fourier Transforms
bit set mode1 BR0; /* bitreversal in i0 */
/* with: m0=m8=1 preset */
i4=redata; /* input */
i1=redata+1;
i2=redata+2;
i3=redata+3;
i0=ORE; /* real output array base must be an integer multiple of N */
m2=OST;
i7=cosine;
i8=imdata; /* input */
i9=imdata+1;
i10=imdata+2;
i11=imdata+3;
i12=imdata; /* output */
i15=sine;
m1=4;
m9=m1;
f0=dm(i7,m0), f5=pm(i9,m9);
f8=f0*f5, f4=dm(i1,m1), f1=pm(i15,m8);
f9=f0*f4;
f12=f1*f5, f0=dm(i7,m0), f5=pm(i11,m9);
f13=f1*f4, f12=f9+f12, f4=dm(i3,m1), f1=pm(i15,m8);
f8=f0*f4, f2=f8-f13;
f13=f1*f5;
f9=f0*f5, f8=f8+f13, f0=dm(i7,m0), f5=pm(i10,m9);
f13=f1*f4, f12=f8+f12, f14=f8-f12, f4=dm(i2,m1), f1=pm(i15,m8);
f11=f0*f4;
f13=f1*f5, f6=f9-f13;
f9=f0*f5, f13=f11+f13, f11=dm(i4,m1);
f13=f1*f4, f8=f11+f13, f10=f11-f13;
/*________Do N/4-1 radix-4 butterflies_______*/
lcntr=N/4-1, do lstage until lce;
f13=f9-f13, f4=dm(i1,m1), f5=pm(i9,m9);
f2=f2+f6, f15=f2-f6, f0=dm(i7,m0), f1=pm(i15,m8);
f8=f0*f4, f3=f8+f12, f7=f8-f12, f9=pm(i8,m9);
f12=f1*f5, f9=f9+f13, f11=f9-f13, f13=f2;
f8=f0*f5, f12=f8+f12, f0=dm(i7,m0), f5=pm(i11,m9);
f13=f1*f4, f9=f9+f13, f6=f9-f13, f4=dm(i3,m1), f1=pm(i15,m8);
f8=f0*f4, f2=f8-f13, dm(i0,m2)=f3, pm(i12,m8)=f9;
f13=f1*f5, f11=f11+f14, f7=f11-f14, dm(i0,m2)=f7, pm(i12,m8)=f6;
f9=f0*f5, f8=f8+f13, f0=dm(i7,m0), f5=pm(i10,m9);
f13=f1*f4, f12=f8+f12, f14=f8-f12, f4=dm(i2,m1), f1=pm(i15,m8);
f11=f0*f4, f3=f10+f15, f8=f10-f15, pm(i12,m8)=f11;
f13=f1*f5, f6=f9-f13, dm(i0,m2)=f3, pm(i12,m8)=f7;
f9=f0*f5, f13=f11+f13, f11=dm(i4,m1);
lstage:
230
Fourier Transforms 7
f13=f1*f4, f8=f11+f13, f10=f11-f13, dm(i0,m2)=f8;
f13=f9-f13;
f2=f2+f6, f15=f2-f6;
f3=f8+f12, f7=f8-f12, f9=pm(i8,m9);
f9=f9+f13, f11=f9-f13, dm(i0,m2)=f3;
f9=f9+f2, f6=f9-f2, dm(i0,m2)=f7;
pm(i12,m8)=f9;
f11=f11+f14, f7=f11-f14, pm(i12,m8)=f6;
f3=f10+f15, f8=f10-f15, pm(i12,m8)=f11;
dm(i0,m2)=f3, pm(i12,m8)=f7;
dm(i0,m2)=f8;
rts (db);
dm(i0,m2)=f0;
bit clr mode1 BR0; /* no bitreversal in i0 any more */
.ENDSEG;
231
7 Fourier Transforms
7.6.4 TWIDRAD2.C - Radix2 Coefficient Generator
/*______________________________________________________________________
TWIDRAD2.C Twiddle Factor Generator for ADSP-21020 Radix-2 FFT
#include <stdio.h>
#include <math.h>
main()
{
int i, n;
double freq, c, s;
double pi;
FILE *s_file;
FILE *c_file;
char filename1[25];
char filename2[25];
/* initialize pi */
pi = 4.0*atan(1.0);
232
Fourier Transforms 7
printf(“Enter the number of points -> “);
scanf(“ %d”,&n);
n=n/2;
freq=2.0*pi*0.5/(double)n;
for (i=0; i <= n-1; i++)
{
s=sin((double)i * freq);
c=cos((double)i * freq);
fprintf(s_file,”%22.14e\n”,s);
fprintf(c_file,”%22.14e\n”,c);
printf(“%d : %22.14e : %22.14e\r”,i,s,c);
}
fclose(s_file);
fclose(c_file);
printf(“\nFinished\n”);
}
233
7 Fourier Transforms
7.6.5 TWIDRAD4.C - Radix4 Coefficient Generator
/*__________________________________________________________________________
TWIDRAD4.C
C program to generate the radix-4 twiddle factor table for the
fast FFT-program RAD4.ASM for the ADSP21020.
#include <stdio.h>
#include <math.h>
#define MAX_LEN 16384
int b[MAX_LEN];
main() {
double tc;
double ts;
double k1, k2, k3; /* coeff for table calculation */
double pi;
char cosfn[20]; /* strings for file names */
char sinfn[20];
int length, /* length of FFT */
iter, /* length/4 */
i, j, k;
FILE *s_file, *c_file;
/* initialize pi */
pi=4.0*atan(1.0);
/* User interface */
234
Fourier Transforms 7
printf(“ Name of Sine table to be created: “);
scanf(“ %s”,sinfn);
s_file=fopen(sinfn,”w”);
/* Start Calculations */
if (length > 1024)
printf(“\n\n Thinking hard . . .”);
else
printf(“\n\n Thinking . . .”);
iter = length/4;
/* calculate tables */
k1 = (2*pi/(double)length);
k2 = (2*pi/(double)length)*2;
k3 = (2*pi/(double)length)*3;
for (i=0;i<iter;i++) {
tc = cos(b[i]*k1);
ts = sin(b[i]*k1);
fprintf(c_file,”%22.14e\n”,tc);
fprintf(s_file,”%22.14e\n”,ts);
tc = cos(b[i]*k3);
ts = sin(b[i]*k3);
fprintf(c_file,”%22.14e\n”,tc);
fprintf(s_file,”%22.14e\n”,ts);
tc = cos(b[i]*k2);
ts = sin(b[i]*k2);
fprintf(c_file,”%22.14e\n”,tc);
fprintf(s_file,”%22.14e\n”,ts);
}
fclose(c_file);
fclose(s_file);
printf(“\n Done!\n\n”);
}
235
7.7 REFERENCES
[ADI92] Analog Devices Inc., A. Mar, ed. 1992. Digital Signal
Processing Applications Using The ADSP-2100 Family,
Englewood Cliffs, NJ: Prentice-Hall, Inc.
237
8 Graphics
The subroutine calculates the outcodes for both of the endpoints of each
line. The outcode is a six-bit value that represents this truth table:
The algorithm performs these six tests and sets the outcode bits
accordingly. If an outcode is zero, the point lies within the view volume.
trivial acceptance both endpoints inside view volume, the line can be
simply rendered
trivial rejection both endpoints outside the view volume, the line is
not rendered at all
See [FOLEY90] p. 145-149 for further information about 3-D Graphics Line
Accept/Reject.
238
Graphics 8
8.1.1 Implementation
Each point is subjected to six test to determine its outcode. After the six
compares are performed on each point, the Compare Accumulator (upper
six bits of ASTAT) have these values:
A point is only in the view volume if all six of these bits are zero.
Both outcodes must be all zeroes for the line to be trivially accepted.
If the logical OR of the outcodes is not equal to all zeroes, the line can be
trivially rejected.
When the line is neither trivially accepted nor rejected, it is clipped. There
will be at least one bit and at most three bits in the outcode which are non-
zero. Each non-zero bit requires the line to be clipped to its corresponding
window coordinate as shown Table 10.1 above. For example, if outcode bit
2 is non-zero, the line must be clipped to the Ymin boundary.
After each line is classified, and possibly clipped, a value is added to the
display list. A zero (0) in the display list represents a trivially rejected line,
a one (1) represents a trivially accepted line, and a two (2) indicates the
line required clipping. For lines that required clipping, the clipped
endpoint values (two points, i.e. six values) are written to the clipped
endpoint list.
239
8 Graphics
8.1.2 Code Listing
/
*********************************************************************************
File Name
accrej.asm:
Version
1.0
Purpose
Performs trivial accept/reject checking on lines.
Equations Implemented
Calling Parameters
REGISTER FILE
r7 number of lines to be examined
r12 preload with 0x3F (six ones)
r13 preload with 2
r14 preload with 1
r15 preload with 0
Return Values
Registers Affected
r0-r9
Cycle Count
25*N + 28 worst case (all lines require clipping, 24 cache misses),
21*N + 1 best case (all lines trivially accepted, PMDAs in cache)
where N = number of lines, and worst case does NOT include time to perform
clipping
# PM Locations
45 instructions + 6 data
# DM Locations
10*N, where N = Number of lines to be examined.
*********************************************************************************/
240
Graphics 8
.EXTERN clipline;
.EXTERN accrej;
/* static registers */
#define REJECT r15 /* = 0 */
#define ACCEPT r14 /* = 1 */
#define CLIP r13 /* = 2 */
#define SIX_ONES r12 /* = 0x3f */
/* indices to lists */
#define DSP_LIST i1
#define CLIP_LIST i2
241
8 Graphics
/* START OF LOOP */
lcntr = r7, do arlp-1 until lce;
/* Perform OUTCODE #2 Comparisons */
comp (f8,X2), f8=pm(i8,m8); /* X2 < Xmin ? */
comp (X2,f8), Y2=dm(i0,m0), f8=pm(i8,m8); /* X2 > Xmax ? */
comp (f8,Y2), f8=pm(i8,m8); /* Y2 < Ymin ? */
comp (Y2,f8), Z2=dm(i0,m0), f8=pm(i8,m8); /* Y2 > Ymax ? */
comp (f8,Z2), f8=pm(i8,m8); /* Z2 < Zmin ? */
comp (Z2,f8); /* Z2 > Zmax ? */
clipit:
if ne call clipline;
OC1 = astat;
OC1 = fext OC1 by 26:6;
242
Graphics 8
comp (f8,X2), f8=pm(i8,m8); /* X2 < Xmin ? */
comp (X2,f8), Y2=dm(i0,m0), f8=pm(i8,m8); /* X2 > Xmax ? */
comp (f8,Y2), f8=pm(i8,m8); /* Y2 < Ymin ? */
comp (Y2,f8), Z2=dm(i0,m0), f8=pm(i8,m8); /* Y2 > Ymax ? */
comp (f8,Z2), f8=pm(i8,m8); /* Z2 < Zmin ? */
comp (Z2,f8); /* Z2 > Zmax ? */
OC2 = astat;
OC2 = fext OC2 by 26:6;
rej2:
if ne rts (db);
if ne dm(i1,m0) = REJECT;
r9 = OC1 or OC2;
acc2:
if eq dm(i1,m0) = ACCEPT;
clip2:
if ne call clipline;
rts;
.ENDSEG;
243
8 Graphics
The cubic Bezier polynomial uses four points to define a curve (or
surface): both endpoints and two other points not on the curve which
define tangents on the curve’s endpoints. These points (the p terms in the
following equation, are referred to as control points or knots (see Figure
8.1).
244
Graphics 8
8.2.1 Implementation
The code given is not looped. For a similar example taking advantage of
hardware-supported DO loops, see bspline.asm .
The register definitions used in the code below refer to the various
sections of the factored equation being assembled, as shown in Figure 8.2:
245
8 Graphics
File Name
bezier.asm
Version
Purpose
Define a curve (or surface).
Equations Implemented
3 2 2 3
x(t) = p (1-t) + 3 p t (1-t) + 3 p t (1-t) + p t
1x 2x 3x 4x
Calling Parameters
f0 = input value (t)
f1 = t - 1
f9 = P1
f5 = P2
f6 = P3
f7 = P4
f14 = 3.0
Return Values
f0 = x(t)
Registers Affected
f8, f12, f13
Cycle Count
11 cycles per bezier evaluation, straight-line code
10 cycles per bezier evaluation, looped
400ns per bezier evaluation @ 40ns instruction cycle
2.5 million bezier evaluations/sec @ 40ns instruction cycle
# PM Locations
11
# DM Locations
0
246
Graphics 8
*********************************************************************************/
.GLOBAL bezier;
#define P1 f9
#define P2 f5
#define P3 f6
#define P4 f7
#define TMP f4
#define A f12
#define B f8
#define C f8
#define D f8
#define E f12
#define F f12
#define G f8
#define H f13
#define I f12
F = E * P1;
H = T * P4, G = D - F;
H = H * T;
rts (db), H = H * T;
I = G * T_M1;
f0 = H + I;
.ENDSEG;
247
8 Graphics
8.3.1 Implementation
This routine evaluates the Cubic Bspline polynomial:
248
Graphics 8
The register definitions used in the code below refer to the various
sections of the factored equation, as shown in Figure 8.4:
M1 A6 A5 A4
A1 M2 M3
A2 A3
AX AY
The loop counter is set with R0, which holds the number of values for
which the Cubic B-Spline Equation will be executed.
249
8 Graphics
File Name
bspline.asm
Version
Purpose
Represent curves and surfaces as a compact list of control points.
Equations Implemented
3 2 3 2 3
-(u-1) p + [ 3u (u-2) + 4 ] p - [ 3(u + u + u) + 1) ] p + u p
1 2 3 4
x(u) = ———————————————————————————————————
6.0
Calling Parameters
r0 = number of x(u) equations to be calculated.
f14 = 1.0
i8 = PM pointer to P1 —> P4
i0 = DM pointer to 1-D array of input U values
i1 = DM pointer to constants { 3.0, 4.0, 0.1666666666 }
i2 = DM pointer to results x(U)
m1 = 0
m0 = 1
Return Values
i2 = pointer to results x(U)
Registers Affected
F0, F2, F4, F7, F8, F12
Cycle Count
15N + 11 cycles per bspline evaluation
600ns per bspline evaluation @ 40ns instruction cycle
1.7 million bspline evaluations/sec @ 40ns instruction cycle
# PM Locations
17 instruction + 4 data
# DM Locations
4 + 2 * N, where N = number of times x(U) will be calculated
250
Graphics 8
*********************************************************************************/
.GLOBAL bspline;
#define M1 f0
#define M2 f1
#define M3 f2
251
8 Graphics
#define AY f12 /* sum of last 2 terms */
U = dm(i0,m1);
AU = dm(i0,m0);
MUM1 = U - ONE;
rts;
.ENDSEG;
252
Graphics 8
Bit-Blt (short for Bit Boundary Block Transfer) is an algorithm for the
transfer of image data (pixels) from one location to another. The location
from which the data is transferred is called the source; the location to
which the data goes is called the destination. In addition to performing the
transfers, Bit-blt can perform logical or mathematical operations on the
data.
8.4.1 Implementation
BitBlt, or Bit Block Transfer, is the transferring of N words of data to a
destination where each destination word is formed by taking bits
OFFSET–1 to 0 from the one source datum, and bits 31 down to OFFSET of
the next source datum. For example, Figure 8.5 shows BitBlt between two
areas in memory, with OFFSET = b :
The BitBlt algorithm alternately uses two registers, X and Y ,to store the
source data to be transferred and two registers, P and Q , to hold the
results of shifting the bit patterns in X and Y. The offset is denoted as b.
Figure 8.6 shows a the usage of X and Y:
31 b b–1 0
X ––> xxxxxxxxx PPPPPPP
Y ––> PPPPPPPPP QQQQQQQ
253
8 Graphics
5. X OR Y -> DM_DEST.
6. Shift Y left by 32 – OFFSET. Read new value into X, Shift X right by OFFSET.
7. Y OR X -> DM_DEST.
254
Graphics 8
File Name
bitblt.asm
Version
Purpose
Bit Block Transfer.
Equations Implemented
None
Calling Parameters
r0 N
r1 OFFSET
i0 pointer to start of source
i1 pointer to start of destination
m0 +1
Return Values
none
Registers Affected
r0, r2-r6
Cycle Count
2N + 7 cycles
12.5 million words / second @ 40ns instruction cycle
50 million bytes / second @ 40ns instruction cycle
# PM Locations
15
# DM Locations
2 * N, where N is number of words to transfer
255
8 Graphics
*********************************************************************************/
256
Graphics 8
8.5.1 Implementation
Although the ADSP-21020 is a floating-point processor, the ability to use
both the ALU and Data Address Generators in parallel allow the required
four fixed-point operations to be executed in two cycles.
If the inner loop is implemented so that the ALU calculates the error term,
ERR , and DAG1 calculates the X and Y coordinates, the loop can be
implemented in two instructions:
do lp until lce;
if ge ERR = ERR - AX, modify(Y,SY);
ERR = ERR + AY, modify (X,SX);
lp: set_pixel(X,Y);
257
8 Graphics
This implies that the entire Bresenham inner loop, including setting the
chosen pixel, can be performed in as few as three instructions.
258
Graphics 8
File Name
bresen.asm:
Version
1.0
Purpose
Draws pixels approximating a line between two specified endpoints.
Equations Implemented
Y=SIN(X) or
Y=COS(X)
Calling Parameters
r0 = x1 = x coord on line start
r1 = y1 = y coord on line start
r2 = x2 = x coord on line end
r3 = y2 = y coord on line end
l2 = 0
l3 = 0
Return Values
F0 = Sine (or Cosine) of input Y=[-1,1]
Registers Affected
r0-r11, i2, m2, i3, m3
Cycle Count
2N + set_pixel*(N+1) + 25 cycles
where N = number of points
= max (abs(x2-x1),abs(y2-y1))
# PM Locations
34 words
# DM Locations
11 Words
259
8 Graphics
*********************************************************************************/
#define X1 r0
#define Y1 r1
#define X2 r2
#define Y2 r3
#define X i2
#define Y i3
#define XLEN l2
#define YLEN l3
#define XSTP 1
#define YSTP 1
#define SX m2
#define SY m3
#define DX r4
#define DY r5
#define AX r6
#define AY r7
#define RSX r8
#define RSY r9
#define ERR r11
/*——————————————————————————————————————
Bresenham Line Drawing Function
Label Meaning
———— ———————————————
bresen start of function
oct1 octant 1 line-drawing algorithm
lp1 octant 1 loop
oct2 octant 2 line-drawing algorithm
260
Graphics 8
lp2 octant 2 loop
——————————————————————————————————————*/
.SEGMENT /pm pm_code;
.GLOBAL bresen;
bresen:
NEG_ONE = -1;
XLEN = 0; YLEN = 0;
RSX = 0; RSY = 0;
DX = X2 - X1;
if ge RSX = RSX + XSTP; /* if ( x2>x1 ) step x-coord up */
if lt RSX = RSX - XSTP; /* if ( x2>x1 ) step x-coord down */
DY = Y2 - Y1;
if ge RSY = RSY + YSTP; /* if ( y2>y1 ) step y-coord up */
if lt RSY = RSY - YSTP; /* if ( y2>y1 ) step y-coord down */
AX = abs DX;
AX = ashift AX LEFT; /* X error term = 2 * ( abs(x2-x1) ) */
AY = abs DY, SX = RSX; /* save register-SX into modify register */
AY = ashift AY LEFT; /* Y error term = 2 * ( abs(y2-y1) ) */
comp(AX,AY), SY = RSY; /* save register-SY into modify register */
if lt jump oct2(db); /* if ( abs(x2-x1) > abs(y2-y1) ) { */
X = X1; /* computeOctant1; */
Y = Y1; /* } else { computeOctant2; } */
oct1:
set_pixel(X,Y);
oct2:
set_pixel(X,Y);
261
8 Graphics
P’ = M * P
where
8.6.1 Implementation
To implement 3-D Translation, Rotation and Scaling, as a 4x4 by 4x1
matrix multiply requires 16 multiplications and 12 additions. In this
routine, assume that M is a matrix of the form:
262
Graphics 8
Since the bottom row contains mostly zeroes, this matrix it can be broken
down into a 3×3 matrix of the r terms and a 3×1 matrix of the t terms; when
added together the two matrices yield the same values. The
transformation can be decomposed as
P’ = R * P + T
where
P= px
py
pz
T= tx
ty
tz
263
8 Graphics
The order of operations has been rearranged slightly to reduce cycle time
and also to take advantage of “empty” instruction slots that are available
during multifunction instructions. Instead of first performing all
multiplications of R * P and then adding T, multifunction capability is
utilized. For example, the F12 term in the last instruction before the loop
holds
The program is organized so that the code before the loop reads the
coordinates for the first point to be processed, the first two rows of matrix
R, and enough operands to supply the registers within the loop with data.
After the loop is processed, the remaining code calculates the last point
coordinates.
264
Graphics 8
File Name
transf.asm
Version
Purpose
Scaling, rotation, and repositioning with basic matrix operations.
Equations Implemented
P’ = M * P where
Calling Parameters
r0 = number of points to transform
i0 = index to point P (x,y,z,w)
m0 = +1
m1 = +2
i1 = index to transformed point area
i2 = index to matrix T
l2 = 3 (length of T)
i8 = index to matrix R
l8 = 9 (length of R)
m8 = +1
Return Values
Registers Affected
r0-r6, r8-r10, r12-r15
Cycle Count
10N+5 cycles, where N 3-D points are transformed
400ns per point @ 40ns instruction cycle
2.5 million points/sec @ 40ns instruction cycle
# PM Locations
27 instructions + 9 words of PM data
# DM Locations
10 words
265
8 Graphics
*********************************************************************************/
.GLOBAL transf;
f14=f8+f14, f1=dm(i0,m0),
f4=pm(i8,m8);
/*cont with circ buf reads*/
f15=f1*f4, f14=f11+f14, f2=dm(i0,m0), f4=pm(i8,m8);
f8=f2*f4, f3=dm(i0,m1), f4=pm(i8,m8);
f8=f3*f4, f12=f8+f15, f9=dm(i2,m0), f4=pm(i8,m8);
f15=f1*f4,f12=f8+f12,dm(i1,m0)=f13,f4=pm(i8,m8);
/*f13=pxr21+pyr22+pzr23+ty*/
trlp: f8=f2*f4, f12=f9+f12, dm(i1,m0)=f14, f4=pm(i8,m8);
/*f14=pxr31+pyr32+pzr33+tz*/
f14=f8+f14;
rts(db), f14=f11+f14;
dm(i1,m0)=f13; /*f13 = pxr21 +pyr22+pzr23
+ty*/
dm(i1,m0)=f14; /*f14 = pxr31+ pyr32+
pzr33+tz*/
266
Graphics 8
8.7 MULTIPLY 4×4 BY 4×1 MATRICES (3D GRAPHICS TRANSFORMATION)
.ENDSEG;
8.7.1 Implementation
This code example can be used in two ways. If the list of points is a
sequence of [x ,y, z, w] coordinates, assemble this code normally (asm21k
mul44x41). If the list of points is [x, y, z] coordinates only, and the w
coordinate is buried in another list of points, define the preprocessor
variable WOUT with the syntax asm21k -DWOUT mul44x41 . This may
be needed if transf.asm has been used in previous transformations,
since the transf.asm code doesn’t keep the w coordinate embedded in
the [x, y, z] coordinate list.
267
8 Graphics
of a cycle and written to at the end of the cycle. When register values are used as
input operands in algebraic expressions, the value that is operated upon is the value
contained in the register before the current cycle is executed.
File Name
mul44x41.asm
Version
Purpose
General 4x4 by 4x1 3D graphics transform on N 4x1 points.
Equations Implemented
Calling Parameters
r9 = number of points to transform
i0 = index to [x,y,z] coordinates
m0 = +1
m1 = +4
i1 = index to transformed point area
i2 = index to [w] coordinates
i8 = index to 4x4 matrix
l8 = 16
m8 = +1
Return Values
Registers Affected
r1, r4-r9, r14-r15
Cycle Count
16N+4 cycles, where N 3-D points are transformed
640ns per point @ 40ns instruction cycle
1.56 million points/sec @ 40ns instruction cycle
# PM Locations
268
Graphics 8
36 words of instructions + 16 words of PM data
# DM Locations
(8 * N) + 4, where N is the number of points to transform
*********************************************************************************/
.GLOBAL mul44x41;
#ifdef WOUT
#else
#endif
269
8 Graphics
f8=f1*f7, f14=f8+f14, f4=dm(IP,m0), f1=pm(i8,m8);
f15=f1*f4, f14=f8+f14, f5=dm(IP,m0), f1=pm(i8,m8);
f8=f1*f5, f6=dm(IP,m0), f1=pm(i8,m8);
f8=f1*f6, f15=f8+f15, f7=dm(IS,MS), f1=pm(i8,m8);
trlp: f8=f1*f7, f15=f8+f15, dm(i1,m0)=f14, f1=pm(i8,m8);
.ENDSEG;
8.8.1 Implementation
This code performs a table lookup with interpolation to approximate f(x)
from a table of values. An image (or a value on a graph ) can be calculated
with the following data:
270
Graphics 8
• values of the two points that surround the needed value (two y-
coordinates)
x – X0
• initial x value (starting index)
• spacing (step size) between theSindices
where
271
8 Graphics
needed value lies
S = x spacing in the table (step size)
X0 = value of the first index of the table
X = floor of (( x - X0 ) / S )
File Name
tblllkup.asm:
Version
Purpose
performs a table lookup with interpolation to approximate f(x) from a table
of values.
Equations Implemented
x - X0
f’(x) = { ( F[X+1] - F[X] ) * (--- - X ) } + F[X]
S
where:
Calling Parameters
f0 = x
f7 = 1/S
f15 = X0
i0 = start of table
b0 = i0 (base address for circular buffer)
l0 = length of table
m0 = 1
Return Values
f1 = interpolated f(x)
Registers Affected
MODE1 Register Bits:
272
Graphics 8
TRUNC = 1
Cycle Count
11 cycles
440ns @ 40ns instruction cycle
Note: 11 cycles includes the DAG hold after load of m1 register.
# PM Locations
10 words of instructions
# DM Locations
N words data memory, where N is the number of values in the table
*********************************************************************************/
.GLOBAL tbllkup;
273
8 Graphics
RES = RES * INTRP; /*multiply by interpolation factor*/
RES = RES + FI; /*add to known value */
.ENDSEG;
cx = ay * bz – az * by
cy = az * bx – ax * bz
cz = ax * by – ay * bx
8.9.1 Implementation
To maximize looped performance, this subroutine simultaneously
computes two vector cross products within a twelve-instruction loop.
Since each cross product requires six multiplications, the effective rate of
six cycles per cross product is the best rate that can be achieved on a
274
Graphics 8
processor with a single parallel multiplier.
File Name
xprod.asm
Version
Purpose
Vector Cross Product: produces a vector perpendicular to the two vector
operands.
Equations Implemented
cx = ay*bz - az*by
cy = az*bx - ax*bz
cz = ax*by - ay*bx
Calling Parameters
F0 = Input Value X=[6E-20, 6E20]
l_reg=0
Return Values
Registers Affected
r11 = number of cross products to perform (must be even and greater than
4)
i0 = index to list of A vectors
l0 = 0
m0 = +1
i1 = index to C output vectors
l1 = 0
i8 = index to list of B vectors
l8 = 0
m8 = +1
Cycle Count
6N+7 cycles, where N Cross Products are performed
240ns per Cross Product @ 40ns instruction cycle
275
8 Graphics
4.2 million Cross Products/sec @ 40ns instruction cycle
# PM Locations
29 words of PM instructions = 3 * N words of PM data,
where N is number of cross products to perform
# DM Locations
6 * N words of DM, where N is number of cross products to perform
*********************************************************************************/
.GLOBAL xprod;
#define AX1 f1
#define AY1 f2
#define AZ1 f3
#define AX2 f0
#define AY2 f3
#define AZ2 f2
#define BX1 f5
#define BY1 f6
#define BZ1 f7
#define BX2 f4
#define BY2 f7
#define BZ2 f6
#define AXBY f8
#define AYBX f12
#define AYBZ f9
#define AZBY f13
#define AZBX f10
#define AXBZ f14
#define CX f15
#define CY f15
#define CZ f15
r11 = lshift r11 by -1; /* 2 vector cross prods per loop, so divide */
/* original number of points by 2 */
r11 = r11-1, AX1=dm(i0,m0), BX1=pm(i8,m8);
/* first multply occurs */
/*outside before loop,so subtract 1 */
/* from num of points */
AY1=dm(i0,m0), BY1=pm(i8,m8); /* read AY1, BY1 */
AZ1=dm(i0,m0), BZ1=pm(i8,m8); /* read AZ1, BZ1 */
276
Graphics 8
AYBZ=AY1*BZ1;
AZBY=AZ1*BY1, AX2=dm(i0,m0), BX2=pm(i8,m8); /*read data of next pt
*/
8.10 REFERENCES
[FOLEY90] Foley, J., A. VanDam, et. al. 1990.Computer Graphics,
Principles and Practice. Addison-Wesley.
277
Image Processing 9
where
279
9 Image Processing
A matrix is a natural data structure for storing two-dimensional data. For
the two dimensional convolution, the data input values (x-values) are
stored in the data matrix (in data memory), and the impulse response
values, h-values, are stored in the transfer function matrix (in program
memory).
9.1.1 Implementation
FIR filters are used in the two-dimensional signal space just as they are in
the one-dimension signal space. For two-dimensional signals kernal
convolution performs FIR filtering on an image matrix. The kernal
convolution implements the following equation:
+∞ +∞
y(n1, n2) = ∑ ∑ x(k1, k2) h(n1 – k, n2 – k2)
k1 = –∞ k2= –∞
where
280
Image Processing 9
Figure 9.1 shows A graphical depiction of the 3 × 3 kernel convolution.
Data Array:
0,0 0,1 0,2 0,3 0,4 . . . 0,M-1
~ . . . . . ~
~ . . . . . ~
. . . . .
N-2,M-1
The data array elements in the upper left are referred to as {d00,d01...}.
The convolution kernel elements are referred to as { c00, c01, ... , c32, c33 }.
The first two convolutions are formed from these products:
c00*d00 = a0
c01*d01 = a1
c02*d02 = a2
c10*d10 = a3
c11*d11 = a4 SUM convolution sum #1
c12*d12 = a5
c20*d20 = a6
c21*d21 = a7
c22*d22 = a8
c00*d01 = b0
c01*d02 = b1
c02*d03 = b2
c10*d11 = b3
c11*d12 = b4 SUM convolution sum #2
c12*d13 = b5
c20*d21 = b6
c21*d22 = b7
c22*d23 = b8
281
9 Image Processing
Figure 9.2 shows that for every convolution sum there are
The program executes the outer loop m-2 times because the size the data
matrix is 3 × 3 the data starts at the upper left hand corner of the matrix. At
the m-2th iteration through the outer loop, the 3 × 3 kernel matrix covers
the data values through m; further loop iterations beyond m-2 times are
redundant. Therefore, the outer loop of the program determines the
starting point of the convolution and manages the update of the array
indices from row to row. The advantages of the automatic post modify
DAGs are obvious; one modify statement and one read statement can
manage the indices to process the data array.
The inner loop of the algorithm performs all of the calculations. While the
outer loop determines the starting point of the convolution and manages
the update of the array indices from row to row, the inner loop of the
algorithm performs the calculations. The third kernel coefficient is stored
in F5 in the setup segment of the code so as to avoid an extra cycle for
program memory read within the loop. (The ADSP-210xx multifunction
operations allow a data memory read/write and a program memory
read/write in one cycle, and the algorithm requires 18 reads and one
write.)
The calculations start at the upper left hand corner of the image matrix.
The first iteration of the kernel matrix by the image matrix performs the
operations over the first , second, and third rows. The second iteration
covers the second, third, and fourth rows, etc. Therefore, at the M – 2
iteration, the M – 1 and M rows have already been dealt with, and, in the
interest of time and space, we set the outer loop equal to M – 2.
The write operation to the PM location that contains the convolution sum
is at the second instruction in the inner loop, which may seem like an
unusual place for it. To minimize the cycle count, this write of the partial
sum cannot occur at the bottom of the loop. F12 is used to hold the partial
products, and F8 is used to hold the running total of the partial sums.
282
Image Processing 9
The first time through the loop, F8 (which only contains one product) is
written to program memory at the relative bottom of the circular buffer
output (where the base register is currently pointing). When the circular
buffer is written to again, with the valid sum of products, the memory is
correctly written at the relative top of the circular buffer.
File Name
comv3x3.asm
Version
Purpose
This code performs the Kernel Convolution.
Equations Implemented
y(n1, n2) = SUM (SUM ( x(k1, k2) h(n1 – k, n2 – k2))
Calling Parameters
none
Return Values
Registers Affected
Cycle Count
Setup: This code initializes register constants and address pointers needed.
cycles=10+1
time=11•50ns=550ns
# PM Locations
27 words of instructions + 9 words of PM data
# DM Locations
m * n + (m -2) * (n -2) where m = rows and n = collumns
*********************************************************************************/
283
9 Image Processing
/*The third coefficient is stored in the register file to free up a cycle to
output a result. The first write to the output buffer is unused, the pointer
then wraps around to the proper location at the start of output.*/
.SEGMENT/DMDATA dmsegment;
.VAR input[m*n];
.VAR output[(m-2)*(n-2)];
.ENDSEG;
.SEGMENT/PMDATA pmdataseg;
.VAR kernal[3*3]=”coef.dat;
.ENDSEG;
.SEGMENT/PMCODE pmcodeseg;
setup: M0=1;
M1=n-2;
M2=-(2*n+1);
M3=2;
M8=1;
M9=2;
B0=input; L0=0;
B8=kernal; L8=@kernal;
B9=output+(m-2)*(n-2); L9=@output;
F5=PM(kernal+2); /*store in register file to save cycle*/
284
Image Processing 9
In any six element subset of nine data values, at least one of the values will
be larger than the median (i.e., at least one value will have a rank of sixth
largest). This algorithm of median filtering considers the first six values in
the order that they appear in memory—no presorting is required. The
highest and lowest values are discarded and a new data value is read.
Performing this compare iteratively on successively smaller groups yields
a median of three values ranked in the middle of the array—the median is
the “middle” value of these three.
9.2.1 Implementation
The median filter algorithm has a simple implementation in ADSP-21000
family assembly language. Most of the program consists of a few
instructions repeated over and over again— the use of macros leads to
code that is easy to read and visualize.
285
9 Image Processing
The ADSP-210xx has a sufficient number of registers to perform the
memory read and write operations and the comparisons and swaps that
follow them. If registers were not available, it would not be possible to
read and compare using the same register; an intermediate storage area in
memory and overhead cycles used for memory management would be
needed.
File Name
med3x3.asm
Version
1.0
Purpose
Median filtering: replace a given data value with the median value of its
neighboring elements.
Equations Implemented
Y=SIN(X) or
Y=COS(X)
Calling Parameters
i0 index to data values
m0 column offset (usually 1)
m1 row offset (usually [#pixels/row - 2])
m2 offset to next 3x3 block (usually -[2*m1 + 1])
Return Values
r0 median value
Registers Affected
r0-r6 tmp values
Cycle Count
56 cycles
2.24µs per 3x3
587ms for median filtering a 512x512 imag
# PM Locations
16 words of instruction
# DM Locations
9 words
286
Image Processing 9
*********************************************************************************/
287
9 Image Processing
After all pixel intensities have been tallied, the histogram array can be
analyzed to determine the darkness or brightness of the image. If the
image is too dark, most of the pixels will record in the first 128 bins
(Figure 9.3). If the image is too bright, most of the pixels will record in the
second 128 bins (Figure 9.4).
O
F
O
C
C
U
R
E
N
C
E
S
0 255
PIXEL INTENSITY
288
Image Processing 9
Figure 9.3 Histogram Of Dark Picture
N
U
M
B
E
R
O
F
O
C
C
U
R
E
N
C
E
S
0 255
PIXEL INTENSITY
9.3.1 Implementation
The following sequence of tasks generalize the calculation of the
histogram array:
2. Add the pixel value to the base address of the histogram array.
File Name
histo.asm
Version
Purpose
An image histogram is an array summarizing the number of occurrences of each
grey level in an image. If a pixel in a typical monochrome image can take on
any of 256 distinct values, the histogram would need 256 bins. In each bin
the number of ocurrences of this grey level is stored.
The algorithm used here assumes the monochrome image is stored in a buffer in
Data Memory, and the histogram is formed in Data Memory.
Equations Implemented
None
Calling Parameters
b0,i0 = data buffer start
l0 = data buffer length
m0 = 1
m15 = 0
Return Values
histogram output bins pointed to by i1
Registers Affected
r0 input data 1
r1 bin value 1
r2 input data 2
r3 bin value 2, tmp register
r14 temp bin buffer #1 start
r15 temp bin buffer #2 start
Cycle Count
3.5N + 2B + 30 cycles
where N = number of data values
B = number of bins
# PM Locations
32 words instructions + 512 words data
# DM Locations
N * 256, where N = number of data values
290
Image Processing 9
*********************************************************************************/
r0=dm(i0,m0);
r0=r0+r14, r2=dm(i0,m0);
lcntr = r3, do hloop until lce;
r2=r2+r15, i8=r0;
i9=r2;
r1=pm(i8,m15); /* 2 cycles due to I register load latency */
r1=r1+1, r3=pm(i9,m15);
r3=r3+1, r0=dm(i0,m0), pm(i8,m15)=r1;
hloop: r0=r0+r14, r2=dm(i0,m0), pm(i9,m15)=r3;
r2=r2+r15, i8=r0;
i9=r2;
r1=pm(i8,m15);
r1=r1+1, r3=pm(i9,m15);
r3=r3+1, pm(i8,m15)=r1;
pm(i9,m15)=r3;
b1=histogram; l1=@histogram;
i8=b8;
i9=b9;
r2=l1;
r2=r2-1, r0=pm(i8,m8);
291
9 Image Processing
r1=pm(i9,m8);
lcntr=r2, do combine until lce;
r2=r0+r1, r0=pm(i8,m8);
ombine: dm(i1,m0)=r2, r1=pm(i9,m8);
rts (db);
r2=r0+r1;
dm(i1,m0)=r2;
ENDSEG;
Median filters are used in image processing to average the image without
blurring edges, like low pass and mean average filters do. Median filters
are non-linear functions and are not used in speech or audio signal
processing.
9.4.1 Implementation
Median filters have a delay line similar to the FIR filter that works on the
last N samples. After the median filter processes a sample, it outputs the
results and waits for the next input sample. When the next sample is
received, it replaces the oldest sample in the delay line.
Figure 9.5 demonstrates the first pass through the median filter to resolve
the lowest magnitude sample. The median filter result is four in this
example.
292
Image Processing 9
median filter is performed, as demonstrated in figure 9.5. Each pass of the
outer loop resolves the next highest magnitude sample in the median filter
buffer. Each pass through the inner loop compares the current lowest
sample in the outer loop pass to the next sample in the median filter
buffer. If the next sample is less than the current lowest, they are swapped.
The last outer loop pass resolves the middle or median sample in the
median filter buffer. After the median sample has been resolved, it is not
necessary to resolve the remaining higher magnitude samples.
base_adr 4 X X X X
5 5 X X X
6 6 6 X X
7 7 7 7 X
1 4 5 6 7
2 2 4 5 6
3 3 3 4 5
before median after first after second after third after fourth
filter pass pass pass or last pass
293
9 Image Processing
File Name
med_fix.asm
Version
Purpose
N tap one-dimensional median filter subroutine for fixed point data
Equations Implemented
Calling Parameters
b1, i1 = start address of input delay line in data memory
l1 = length of delay line buffer
m1 =1 - to modify index registers
b8, i8 = start address of median filter buffer in program mem
l8 =length of delay line buffer
m9 =1 - to modify index registers
Return Values
r4 = median of values in delay line
Registers Affected
r0, r2, r4, r5, r15
Cycle Count
FILTER_LEN + 6*[(FILTER_LEN+1)/2] + 14 + 3*sum[N]
where N=(FILTER_LEN-1)/2 to FILTER_LEN-1
99 cycles for FILTER_LEN=7
# PM Locations
16 Instruction + N Words of PM Data, where N is the order of the median
filter
# DM Locations
N Words, where N is the order of the median filter
294
Image Processing 9
*********************************************************************************/
.segment/pm pmcode;
.GLOBAL start_median;
start_median:
r0=dm(i1,m1);
lcntr=FILTER_LEN-1, do xfer until lce;
xfer: r0=dm(i1,m1), pm(i8,m9)=r0; /* transfer to filter buffer */
/* read from input buffer */
pm(i8,m9)=r0; /* transfer to filter buffer */
/*
median filter loop - find median value in delay line using this technique:
median[M]=median[N], median[M]=median[N]
inc M
inc N
*/
/* each pass through the outer loop resolves the next greatest magnitude */
/* in the median filter buffer - median[N] */
/* inner loop finds minimum of the remaining values in median filter buffer */
.endseg;
/
*********************************************************************************
File Name
med_flt.asm
Version
1.0
Purpose
N tap one-dimensional median filter subroutine for floating point data
Calling Parameters
b1, i1 = start address of input delay line in data memory
l1 = length of delay line buffer
m1 =1 - to modify index registers
b8, i8 = start address of median filter buffer in program mem
l8 =length of delay line buffer
m9 =1 - to modify index registers
Return Values
f4 = median of values in delay line
Registers Affected
f0, f2, f4, f5, r15
Cycle Count
FILTER_LEN + 6*[(FILTER_LEN+1)/2] + 14 + 3*sum[N]
where N=(FILTER_LEN-1)/2 to FILTER_LEN-1
99 cycles for FILTER_LEN=7
# PM Locations
16 instructions + N Words of PM Data,
where N is the order of the median filter
# DM Locations
296
Image Processing 9
N Words, where N is the order of the median filter
*********************************************************************************/
Execution Time:
*/
.segment/pm pmcode;
.GLOBAL start_median;
start_median:
f0=dm(i1,m1);
lcntr=FILTER_LEN-1, do xfer until lce;
xfer: f0=dm(i1,m1), pm(i8,m9)=f0; /* transfer to filter buffer */
/* read from input buffer */
pm(i8,m9)=f0; /* transfer to filter buffer */
/*
median filter loop - find median value in delay line using this technique:
median[M]=median[N], median[M]=median[N]
inc M
inc N
*/
/* each pass through the outer loop resolves the next greatest magnitude */
/* in the median filter buffer - median[N] */
297
9 Image Processing
/* where M=N+1 first */
r15=r15-1, f5=pm(i9,m9); /* decrement inner loop count */
/* read median[M] */
/* inner loop finds minimum of the remaining values in median filter buffer */
.endseg;
9.5 REFERENCES
[GLASSNER90] Glassner, Andrew S., ed. 1990. Graphics Gems. San Diego, CA:
Academic Press, Inc.
298
JTAG Downloader 10
For more information about the JTAG Downloader, contact applications support by email at
processor.support@analog.com.
299
Index
A
Adaptive filters
benchmarks 202
implementations 167
testing shell for adaptive filters 199
uses of 158, 159, 160
Arctangent
implementation 27
subroutine 29
B
Bit block transfer
transfer of image data 253
Bit-reversal 210, 211
Bresenham line drawing
pixels 257
set_pixel 257
C
Cascaded_biquad 104
Convolution equation 90
Cosine approximation
sine/cosine approximation subroutine 18
Cosine functions 15
Cubic B-spline polynomial evaluation
control points 248
curves 248
knots 248
surfaces 248
Cubic bexier polynomial evaluation
control points 244
knots 244
parametric equations 244
three-dimensional curved surfaces 244
337
Index
D
Decimation filters 113
Decimation-in frequency (DIF) FFT 207
Decimation-in-time (DIT) FFT 207
Digit-reversal 213
Digital filtering 89
Discrete fourier transform (DFT) 205
computation of the DFT 206
frequency bins 206
twiddle factors 206
Divide operation
division subroutine 44
implementation 44
iterative convergence algorithm 44
Division 44
E
EPROM
how to make 316
Exponential approximation
exponential function 52
exponential subroutine 55
implementation 53
F
Fast fourier transform (FFT) 205
architectural features for FFTs 210
benchmarks 216
bit-reversal 210, 211
butterfly calculations 208
complex input radix-2 FFT 211
complex input radix-4 FFT 211
decimation-in frequency (DIF) FFT 207
decimation-in-time (DIT) FFT 207
derivation of the fast fourier transform 207
digit-reversal 213
groups 210
inverse complex FFTs 215
radix-2 FFT 205, 207, 212
radix-4 FFT 205, 207, 213
stages 210
twiddle factor 206, 212, 214
Filter structures
lattice filter 163
symmetric transversal structure 162
transversal FIR filter 161
338
Index
G
Gauss-Jordan algorithm 81
Graphic line accept/reject-3D
clipline 239
display device 237
trivial acceptance 238
trivial rejection 238
view volume 237
Graphics algorithms 71
Graphics translation, rotation, & scaling-3D
4×4 transformation matrix 262
multifunction operations 264
x, y, and z coordinates 263
x, y, z, and w coordinates 263
H
Histogram equalization
gain and offset 288
pixel intensity 288
pixels 288
register pipelining 289
I
Image processing 71
Infinite Impulse Response (IIR) filter 119
advantages 100
canonical form 100
cascaded_biquad 104
code listings 106
direct form II 100
implementation 101
Interpolation filters 113
Inverse complex FFTs 215
Iterative convergence algorithm 44
339
Index
J
JTAG
definition 299
instruction register (IR) 306
pins 300
signals 300
test access port operations 305
JTAG downloader
block diagram 302
code listings 320
downloader operations 307
hardware 300
parts list 304
prototype board layout 304
prototype schematic 303
reference 336
software 310
software example 313
state diagram 305
system diagram 301
timing considerations 308
timing diagram 309
TMS & TDI bit generation 311
L
Least Mean Square (LMS) filters
algorithm 164
benchmarks 202
code listing 169
gradient lattice-ladder 189
implementation 168
lattice filter LMS with joint process estimation 189
leaky LMS algorithm 171
normalized LMS algorithm 173
sign-data LMS 179
sign-error LMS 176
sign-sign LMS 183
symmetric transversal filter implementation LMS 185
Logarithm approximations 46
implementation 47
logarithm approximation subroutine 49
340
Index
M
Matrices 71
Matrix functions
inverse of a matrix 71
matrix inversion 81, 84
multiplication 73, 77
multiplying 71
M×N by N×0 multiplication 79
M×N by N×1 multiplication 75
M×N matrix by a N×0 matrix 77
M×N matrix by an N×1 vector 73
rolling the loop 74
row major order 72
storing a matrix 72
Matrix inversion 81
Median filtering (3×3)
comp (compare) 285
delay line 292
image blurring 292
impulse noise 285
macros 285
middle sorting 292
shot noise 285
Multiply 4×4 by 4×1 matrices (3D graphics transformation)
graphics transformation 267
w coordinate 267
Multirate filters
decimation filter listing 117
decimation filters 113
interpolation filter listing 124
interpolation filters 113
multirate systems 113
rational rate changer (external interrupt-based) 138
rational rate changer (timer-based) 129
rational rate changer listing 133
single-stage decimation filter 114
single-stage interpolation filter 122
two-stage decimation filter 143
two-stage interpolation filter 150
341
Index
N
Newton-Raphson iteration 34
P
Polynomial approximation 61
Power function
hidden scaling factor 59
implementation 59
power subroutine 62
pseudo extended-precision product 61
R
Recursive Least Square (RLS) filters
algorithm 165
benchmarks 203
forgetting factor 166
implementation 194
Kalman Gain Vector 166
Z-matrix 194
S
Sine approximation
sine/cosine approximation subroutine 18
Sine/cosine approximation 15
Square root
code listings 35
implementation 34
Newton-Raphson iteration 34
Square root & inverse square root approximations 33
T
Table lookup with interpolation 270
Tangent
implementation 22
min-max polynomial approximation 22
tangent subroutine 24
Tangent approximation 22
Twiddle factor 206, 212, 214
Two-dimensional convolution
convolution sum 279
FIR filters 280
impulse response function 279
linear time-invariant system 279
transfer function matrix 280
two-dimensional convolution sum 279
342
Index
V
Vector cross product
3D graphics 274
back-face culling 274
illumination 274
343
ADSP-21000 Family Development Software
Release 3.2 Installation Guide and
Release Note
7/14/95 page 2
ADSP-21000 Family Development Software Release 3.2
1 GENERAL INFORMATION........................................................................................................................5
1.1 D OCUMENTATION ........................................................................................................................................5
1.2 W HAT ’ S C OVERED IN THIS R ELEASE NOTE ?.....................................................................................................5
1.3 W HO ARE YOU?...........................................................................................................................................5
1.4 FOR T ECHNICAL ASSISTANCE .........................................................................................................................5
1.5 FOR B UG F IXES AND DOCUMENTATION ERRATA................................................................................................5
1.6 W ARRANTY ................................................................................................................................................6
2 SALES PACKAGE......................................................................................................................................7
3 SOFTWARE INSTALLATION.....................................................................................................................7
3.1 PC VERSION ...............................................................................................................................................7
3.1.1 PC Requirements...................................................................................................................................7
3.1.2 PC Installation Procedure......................................................................................................................8
3.1.3 PC Environment Variables......................................................................................................................8
3.2 SUN VERSION..............................................................................................................................................9
3.2.1 Sun Requirements..................................................................................................................................9
3.2.2 Sun Installation Procedure......................................................................................................................9
3.2.3 Sun Environment Variables...................................................................................................................10
4 CHANGES FROM PREVIOUS RELEASE..................................................................................................10
4.1 A SSEMBLER...............................................................................................................................................10
4.2 A SSEMBLY R UN -T IME LIBRARY ....................................................................................................................10
4.3 L INKER ....................................................................................................................................................10
4.4 L OADER ...................................................................................................................................................10
4.5 L IBRARIAN ...............................................................................................................................................11
4.6 SIMULATORS .............................................................................................................................................11
4.6.1 Version Specific Changes......................................................................................................................11
4.6.2 Common Changes...............................................................................................................................11
4.7 E MULATOR ...............................................................................................................................................12
4.8 SPLITTER..................................................................................................................................................12
4.9 C LANGUAGE T OOLS .................................................................................................................................12
4.9.1 G21K Preprocessor.............................................................................................................................12
4.9.2 G21K Optimizing C Compiler...............................................................................................................13
4.9.3 G21K Run Time Library.......................................................................................................................13
4.10 COFF T OOLS..........................................................................................................................................13
4.10.1 CSWAP............................................................................................................................................13
4.10.2 CDUMP...........................................................................................................................................13
5 RESTRICTIONS.......................................................................................................................................13
5.1 A SSEMBLER...............................................................................................................................................14
5.2 A SSEMBLY R UN -T IME LIBRARY ....................................................................................................................14
5.3 L INKER ....................................................................................................................................................14
5.4 L OADER ...................................................................................................................................................14
5.5 L IBRARIAN ...............................................................................................................................................15
5.6 SIMULATORS .............................................................................................................................................15
5.6.1 Version Specific Restrictions.................................................................................................................15
5.6.2 Common Restrictions...........................................................................................................................15
5.7 E MULATOR ...............................................................................................................................................17
5.7.1 Common Restrictions...........................................................................................................................17
5.7.2 ADSP-21010/ADSP-21020 EZ-ICE Specific Restrictions.............................................................................19
5.7.3 SHARC EZ-ICE Specific Restrictions......................................................................................................19
5.8 SPLITTER..................................................................................................................................................20
5.9 C LANGUAGE T OOLS .................................................................................................................................20
5.9.1 G21K Preprocessor.............................................................................................................................20
5.9.2 G21K Optimizing C Compiler...............................................................................................................20
5.9.3 G21K Numerical ‘C’ Extensions............................................................................................................20
7/14/95 page 3
ADSP-21000 Family Development Software Release 3.2
7/14/95 page 4
ADSP-21000 Family Development Software Release 3.2
1 GENERAL INFORMATION
1.1 Documentation
To order additional copies of any of the following publications, contact an Analog Devices sales
agent. Documentation for use with this release includes the following:
Assembler
Linker
Librarian
Simulator
Emulator (EZ-ICE® In-Circuit Emulator)
PROM Splitter
Loader
CSWAP
CDUMP
G21K Optimizing C Compiler
• G21K C Runtime Library
7/14/95 page 5
ADSP-21000 Family Development Software Release 3.2
1.6 Warranty
Analog Devices warrants G21K for users who have purchased this program as part of an
ADSP-21000 Family Development Software package according to the terms and conditions of the
ADSP-2100 and ADSP-21000 Family Development Software License Agreement provided with
this software distribution, subject to the GNU General Public License, also provided therein.
7/14/95 page 6
ADSP-21000 Family Development Software Release 3.2
2 SALES PACKAGE
This software package is available for both PC and SUN workstations. The part numbers are:
3 SOFTWARE INSTALLATION
The installation procedures for the PC version and the Sun version of these tools differ. Note the
requirements for your package and follow the appropriate instructions.
3.1 PC Version
The PC package contains a set of 3-1/2" high-density diskettes. Make sure that your disk drive is
capable of reading the diskettes. If you require 5-1/4" diskettes, please specify this on your
registration form to receive 5-1/4" diskettes.
3.1.1 PC Requirements
This software requires up to 13 MB of hard disk storage. The required system configuration to
run the development software is a 386, or higher, based PC with hard disk, high-density floppy
disk drive, a color video card and VGA monitor, and a minimum of 2 MB extended RAM. DOS 3.1
or higher. Microsoft® Windows™ 3.1 or higher is required to run Windows-based tools. A
mouse is highly recommended.
You must have at least 450 Kb of memory available in the lower 640 Kb in order for the software
to run properly. Most device drivers and TSRs consume this lower memory. Maximize the low
memory available to the software by moving as many of these device drivers and TSRs into high
memory.
The software opens many files and accesses these files frequently. Performance can be
degraded, or fail in some cases, if DOS is not set up to support this. In the config.sys file, be
sure the following directives are present and the values for these directives are at least equal to
the minimum shown below:
• BUFFERS=25
• FILES=40
If you will be using the development software in native DOS, both Extended Memory Support
(XMS) and Expanded Memory Support (EMS) are required. Use memory managers that support
both standards and be sure not to disable EMS memory in your memory manager if you will be
7/14/95 page 7
ADSP-21000 Family Development Software Release 3.2
running native DOS. If you will not be using the development software in native DOS, but will
use the software in DOS boxes under Windows, only XMS memory support is required.
7/14/95 page 8
ADSP-21000 Family Development Software Release 3.2
In addition, the development software, specifically the simulators, use the gnuplot graphics
package from the Free Software Foundation for plotting memory. If you plan on using the plotting
feature of the simulators, be sure you have version 3.5 of gnuplot in your executable search
path. You can obtain gnuplot from the official distribution site; ftp.dartmouth.edu [129.170.16.4].
The file is called /pub/gnuplot/gnuplot3.5.tar.Z. Official mirror sites are, in Australia,
monu1.cc.monash.edu.au [130.194.1.101] and, in Europe, irisa.irisa.fr [131.254.254.2].
Lastly, if using the Xnews server, be sure Xnews patch #100444-66 is installed. This patch can
be obtained from Sun, via anonymous FTP, at sunsolve.sun.com. The file is called
/pub/patches/100444-66.tar.Z.
7/14/95 page 9
ADSP-21000 Family Development Software Release 3.2
source ${MWHOME}/setup-mwuser.csh
7. Source your .cshrc file or, log out and then log back in again, for the changes to take effect.
4.1 Assembler
• Minor bug fixes.
• The ADSP-21060’s memory-mapped IOP Registers are programmed by writing to the
appropriate address in internal memory. The symbolic names of the IOP registers can also be
used in ADSP-21060 programs—the #define definitions for these names are contained in
the file def21060.h which is provided in the INCLUDE directory of the ADSP-21000 Family
Development Software. The def21060.h file is also described in the ADSP-21060 User’s
Manual.
4.3 Linker
• The -r switch is no longer supported.
• Minor bug fixes.
4.4 Loader
• Minor bug fixes.
7/14/95 page 10
ADSP-21000 Family Development Software Release 3.2
4.5 Librarian
• The librarian now accepts absolute path names.
• Minor bug fixes.
4.6 Simulators
4.6.1.1 PC Version
• All simulators are now true 16 bit Windows applications. DOS versions are no longer
provided.
7/14/95 page 11
ADSP-21000 Family Development Software Release 3.2
• The simulators can generate and send debug information to the standard AUX device. The
initialization file contains a verbose mode variable, which enables or disables this information.
In order to display this information, you may either run the dbwin application, supplied with
this development software, or connect a monochrome monitor. The system.ini file, located in
your windows directory, has been modified to insure this information is turned off by default.
Specifically, the [debug] section has been modified to include the line “OutputTo=NUL”.
4.7 Emulator
• All emulators are now true 16 bit Windows applications. DOS versions are no longer
provided.
• Full SHARC IOP bitfield decoding and editing.
• Enhanced memory window tracking by PC, DAG index registers and DMA index registers.
• Format and tracking info shown in memory window title bars.
• Font selection now available via menu pulldown and initialization file variable “font_size”.
• Capability to load only symbols from an executable file via the menu pulldown.
• Significantly improved speed of memory transfers between host and target.
• Control of background DMA activity, when target is halted, via initialization file variable
“dma_stop”.
• The display can now be refreshed by the Ctrl-Alt-Shift G key.
• The initialization files used by the emulators, wice060.ini and wice020.ini, are located in the
Windows directory. The settings in these files replace the command line parameters used in
traditional DOS programs.
• The emulators use the GNU plot utility, “wgnuplot”, to perform plotting of data. Memory plot
commands result in the creation of plot files that are passed to the wgnuplot program, that is
spawned to display the plot windows. See the files wgnuplot.txt and wgnuplot.hlp for more
information.
• The emulators can generate and send debug information to the standard AUX device. The
initialization file contains a verbose mode variable, which enables or disables this information.
In order to display this information, you may either run the dbwin application, supplied with
this development software, or connect a monochrome monitor. The system.ini file, located in
your windows directory, has been modified to insure this information is turned off by default.
Specifically, the [debug] section has been modified to include the line “OutputTo=NUL”.
4.8 Splitter
Minor bug fixes.
7/14/95 page 12
ADSP-21000 Family Development Software Release 3.2
4.10.1 CSWAP
• No changes.
4.10.2 CDUMP
• No changes.
5 RESTRICTIONS
This section details restrictions in this version of the ADSP-21000 Family Development Software.
The restrictions described in this section are supplemental to those described in the
corresponding user’s manual.
7/14/95 page 13
ADSP-21000 Family Development Software Release 3.2
5.1 Assembler
• When using the backslash (\) line continuation character, do not put any characters after the
backslash on the same line; it must be followed immediately by a carriage return. Otherwise,
an assembler error occurs. For example, a comment or a space after a backslash causes an
assembler error.
• When using the Rn=FDEP Rx BY <bit6>:<len6> instruction, use only positive
numbers for <bit6>. A negative number for <bit6> results in a syntax error that is not
properly reported.
• When multiple instructions are placed on a single line in the source file, the listing file shows
only the last instruction on the line.
• Unprintable special characters in the source file may cause the assembler to crash.
• Macro substitutions made with a #define statement cannot begin with a number. The
following example will fail:
#define 2me r0 = 1
• In the architecture file, the END value, must be greater than the BEGIN value when defining a
segment. If not, when computing the size of a segment, the assembler incorrectly interprets
the resulting negative number as a large positive number.
• In the include directory, there is a file called asm_sprt.h, in which are defined various macros
to aid the user in writing assembly level implementations of ‘C’ functions. The use of this file
is described in the ‘C’ Tools Manual. Omitted from the manual is a statement indicating this file
is to be used only as an include file in ‘C’ programs. Unfortunately, when the file is included
from an assembly level routine, the pre-processor defines that are automatically generated
by the compiler, are not asserted. As such, what get’s included from the file are ADSP-
2106x definitions, which is fine if that is your target processor. If your target processor is an
ADSP-21020, you must manually define the processor define, __ADSP21020__, in order to
include the proper ADSP-21020 definitions.
5.3 Linker
• The -s switch, used to strip symbolic information from an executable file does not work
reliably.
• The linker will generate an error if the architecture file defines a multiprocessor memory
space segment for processor 1, ID = 001, that defines locations 0x80000 thru 0x9ffff. This
is because that memory space is in fact locations 0x00000 thru 0x7ffff of that processor.
• In the architecture file, the END value, must be greater than the BEGIN value when defining a
segment. If not, when computing the size of a segment, the linker incorrectly interprets the
resulting negative number as a large positive number.
5.4 Loader
• For link booting to function properly, the file “060_link.exe” must be copied from the
21k/examples/06x/linkboot/linkr0 or linkr1 directory into the 21k/etc directory.
• The boot loader does not make special provision for an executable that can be loaded by the
automatic hardware boot alone. It creates a boot file that contains the boot loader, just to load
7/14/95 page 14
ADSP-21000 Family Development Software Release 3.2
over itself. Although this works without a problem, it generates a boot loader file that is
significantly larger (more than twice the size) than necessary.
• The Loader does not efficiently pack the final initialization. Regardless of the size of the last
initialization, the loader places 256 48-bit words in the boot file.
• When using the loader, do not use the mem21k utility. Any executable that is going to be
processed by ldr21k should not be processed with mem21k. See the “-nomem’ compiler
switch.
• The loader requires the use of address 0x20004 for booting an ADSP-2106x target. Any
executable file that is going to be processed by ldr21k, should have a NOP or an IDLE
instruction at address 0x20004. This only applies to ADSP-2106x targets.
• As noted in the users manual, the DMAC6 control register will not contain a zero when the
program begins execution.
• If a single ADSP-2106x has a processor ID that does not equal zero, a dummy executable file
must be placed on the command line or else the boot will not occur properly. When the boot
loader is executed on an ADSP-2106x with a non-zero ID, it expects to read a table of
contents before the boot data begins. This table of contents points to the proper place in the
ROM that holds the code for each different processor. The ldr21k tool only places a table of
contents at the beginning of a ROM file if more than one executable is given on the command
line. A dummy executable can be created by assembling and linking a file that contains a
single NOP.
• The -fbinary switch has not been fully tested and, although implemented, is not supported for
this release.
5.5 Librarian
• None.
5.6 Simulators
5.6.1.1 PC Version
• The source file sizes in CBUG are restricted to less than 64Kb.
• The Program Manager icons for the simulators were inadvertantly omitted.
7/14/95 page 15
ADSP-21000 Family Development Software Release 3.2
• Save Layout
• Execute Instruction
• Counting Breakpoints
• Auto-Interrupt
• Auto-Flag
• Backtrace
• The ADSP-2106x silicon prioritizes DMA channels as follows: 0, 1, 2, 3, ChainLoad, ExtRW,
4, 5, 6, 7, 8, 9. The simulator has the priority of ChainLoad and ExtRW reversed.
• When entering instructions into a PM window, “if ... else” instructions must be entered with a
comma (,) before the else.
• The simulator allows writing to the following read-only and reserved bits in the IOP memory
space:
LSRQ: 31-20 and 19-16
STCTLx: 28-24
LCTL: 31-30
LCOM: 31-26, 25-23, 11-0
LAR: 31-30
SYSCON: 7
STKY: 26-21
• The simulator incorrectly initializes the multiprocessing vector interrupt register, VIRPT, to
0x40014 (external). The correct value should be 0x20014 (internal).
• The simulator’s display for the bus timeout counter, BCNT, incorrectly displays the bus
timeout maximum value, BMAX.
• The simulator does not automatically clear it’s symbol table when a new executable is loaded.
• The simulator does not honor the short word sign extend bit, SSE, of the MODE1 register.
• Displays for the serial port FIFO buffers will be available in a future release.
• Variable names which are identical to the segment name in which they reside cause the
simulator to omit the variable name from the symbol table. The practice of naming symbols
and segments with identical names should be avoided.
• The architecture file “.BANK” directive has no affect. The various “.BANK” qualifiers are not
asserted by the simulator. In addition, wait states MSIZE, IMDWx bits etc., are not set
automatically. These bits should be set by the user’s code at run time.
• The ADSP-2106x simulator does not support link ports. SPORT DMA transfers are
supported. External port master mode DMA transfers with external memory functionality is
included but not tested nor supported.
• Host and PROM booting options are the only boot loading operation that the ADSP-2106x
simulator supports. Host boot files must be in ASCII format as created by the loader, ldr21k,
with the -fascii switch.
• Arithmetic loops (i.e. non-counter-based loops) containing one or two instructions that
terminate on the first iteration of the loop are not properly simulated. Workaround: Insert NOP
instructions into the body of the loop to make it at least 3 instructions long.
7/14/95 page 16
ADSP-21000 Family Development Software Release 3.2
• When enabling or disabling the timer through the MODE2 register, the timer may exhibit an
extra cycle of latency before the change takes effect. This extra cycle of latency may also
occur when the Auto-Interrupt Control and Auto-Flag Control functions are used.
• If an assembly label coincides with other labels, only the last label linked will be seen by the
simulator symbols display and search.
• In the DAGs, when setting the base registers via the user interface, the corresponding index
register gets set also. This operation differs from the emulators where the index register
does not get set.
• The Buffer Hang Disable (BHD) bit in the ADSP-2106x SYSCON register defaults to a clear
state. This means the processor core will hang by default if memory-mapped FIFOs are
read-when-empty or written-when-full.
• The simulator does not support both serial ports running in loopback simultaneously. The
simulator does support loopback of either serial port, provided only 1 serial port is in loopback
at a time.
• DMA SPORT transfers do not halt properly in the simulator. When a DMA count register
decrements to zero, and no DMA chaining is called for, the DMA channel should disable itself.
However, the DMA activity bit, in the DMASTAT register, is not cleared and the SPORT buffer
status bits are incorrect.
• The Memory Dump command only supports hexadecimal syntax. If dumps in floating point
format are required, use the plot memory command, and examine the resulting .plt file.
• In the ADSP-2106x simulator, if the multiplier underflow bit, MUS, of the STKY register is set,
the simulator incorrectly clears the floating point underflow bit, FLTUI, of the IRPTL register
when the FLTUI interrupt is serviced.
• In the status stack windows, the simulator allows the user to modify the FLAG bits, bits 19
through 22, of the ASTAT registers. These simulator should not allow modification of these
bits.
• In the ADSP-21020 simulator, Program Memory windows do not support the floating point
display mode.
• In the ADSP-21020 simulator, the BSET instruction will not simulate properly if the bit that is to
be set lies in bit position 15 through 31. Note the following run-time library functions may not
simulate properly as they employ use of this function; A-Law/ U-Law encoding, sin, atof,
strtod, dtoi, dmult, dadd and dsub.
• In the ADSP-2106x simulator, memory references to reserved locations in IO space will result
in the “Error --- Instruction Timed Out” message.
• The simulator does not support 40 bit fixed point values with the LOAD command; fixed point
values are limited to 32 bits.
• After a chip reset, all simulators, include the 2 clock cycles that are used to fill the
fetch/decode/execute pipeline, in their cycle counters. The EZ-ICEs do not count these 2
cycles.
5.7 Emulator
This section describes the restrictions for each ADSP-210xx EZ-ICE emulator.
7/14/95 page 17
ADSP-21000 Family Development Software Release 3.2
• The following pulldown menus are not implemented for this release:
• Save Layout
• Execute Instruction
• Counting Breakpoints
• Auto-Interrupt
• Auto-Flag
• Backtrace
• The Program Manager icons for the emulators were inadvertantly omitted.
• The emulator does not automatically clear it’s symbol table when a new executable is loaded.
• The source file sizes in CBUG are restricted to less than 64Kb.
• The architecture file “.BANK” directive has no affect. The various “.BANK” qualifiers are not
asserted by the emulator. In addition, wait states MSIZE, IMDWx bits etc., are not set
automatically. These bits should be set by the user’s code at run time.
• Variable names which are identical to the segment name in which they reside cause the
emulator to omit the variable name from the symbol table. The practice of naming symbols and
segments with identical names should be avoided.
• The Memory Dump command only supports hexadecimal syntax. If dumps in floating point
format are required, use the plot memory command, and examine the resulting .plt file.
• In the DAGs, when setting the base registers via the user interface, the corresponding index
register does not get set. This operation differs from the simulator where the corresponding
index register’s value is set when the base register is set.
• Software breakpoints can not be placed anywhere that a CALL instruction would be invalid
(i.e., either of the two instructions following a Delayed Branch jump or call, or last three
instructions of a DO loop).
• When loading an executable with Memory Verify ON, any ports that are defined may
report a readback error. Ignore this or turn Memory Verify OFF.
• PM locations, 0 through 7 for ADSP-21020 and 0x20000 through 0x20003 for ADSP-2106x,
must not contain any user code and should be set to 0 (NOP).
• Interrupts to the 21020 are not latched when the emulator is halted.
• Single-step does not wait on IDLE instructions (i.e., executes as a NOP).
• Three locations of the PC stack are reserved for the emulator.
• IRPTL and IMASKP have bit 0 set while stepping.
• Timer continues to operate while software breakpoints are encountered (for approximately 4
cycles with zero wait states).
• If the timer interrupt gets serviced while a software breakpoint is executed, the status stack
may not get cleaned up properly; there may be an extra MODE1/ASTAT pair on the stack.
• PMDA instructions perform both program memory accesses when single-stepping (for
example, if the cache contains the instruction to be fetched, it may fetch the instruction
anyway).
• Software breakpoints remain in memory if you select the RUN command then EXIT or QUIT.
This results in the opcode at the breakpoint address being lost (breakpoint address will
contain a CALL).
7/14/95 page 18
ADSP-21000 Family Development Software Release 3.2
• When entering instructions into a PM window, “if ... else” instructions must be entered with a
comma (,) before the else.
• Do not set the IMASKP bit 0 (EMUI). This causes the processor to enter emulator space.
• The [user_sect] “base_address” variable in the wice0x0.ini file must be set correctly
according to the actual jumper configuration on the emulator PC plug-in board. The default
address for the board is 0x340. See the EZ-ICE Emulator Manual for complete hardware
installation details.
• FADDR and DADDR always display the current PC value in the Program Counters window.
• After a chip reset, all simulators, include the 2 clock cycles that are used to fill the
fetch/decode/execute pipeline, in their cycle counters. The EZ-ICEs do not count these 2
cycles.
7/14/95 page 19
ADSP-21000 Family Development Software Release 3.2
• If, upon startup, the architecture file listed in the initialization file contains a “.processor”
statement that conflicts with the target processor, the error message displayed does not
include the filename of the architecture file. For example, the error message displayed is
“Invalid .PROCESSOR directive in .”
rather than
“Invalid .PROCESSOR directive in c:\21k.ach.”
• When using the “command_timeout” feature to automatically stop the emulator after ‘n’
milliseconds, there is a slight chance that the emulator will halt and then automatically restart.
This can happen if the emulator encounters a breakpoint at the same instant as the ‘n’
milliseconds timer expires.
• When using the EZ-ICE with the EZ-LAB board and DspHost software library, resetting the
2106x, while the library is in the middle of a PC to EZ-LAB data transfer may cause the PC to
hang.
• The EZ-ICE does not save and restore the instruction cache contents during emulator
halts/runs and single steps. As a result, when benchmarking code, run the code under test
from start to finish with no breakpoints, halts or single steps.
5.8 Splitter
• None.
7/14/95 page 20
ADSP-21000 Family Development Software Release 3.2
5.10.1 CSWAP
• None.
5.10.2 CDUMP
• None.
7/14/95 page 21