0% found this document useful (0 votes)
2 views61 pages

chap15

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 61

Digital Signal Processing

Processors
Introduction
• Computing power of chip depends on the CPU core
that is present.
• MCU with 8051 core or ARM core
– General purpose cores
– Good in general purpose arithmetic and data processing
• Dedicated hardware required
– Specialized arithmetic operations(fast arithmetic operation
arithmetic like fast multiply and add, floating point
arithmetic, exponentiation etc.
– Application demands fast results for complex math
computations
DSP Processor
• Dedicated hardware with programmable
features is called DSP processor because
applications which need such complex math
are typically DSP computations.
Application Scenario
• Many applications used in embedded need to
use DSP computations.
• E.g. think of mobile phone, it deal with
compressed form of audio video and still
images and thus computation in real time are
mandatory.
• But a processor may not able to perform DSP
computations in real time, and therefore a
DSP core is needed.
Application Scenario
• Such systems use a powerful MCU for general
purpose computations and an additional DSP
core.
• Two cores may present as two separate chips
or a single dual core processor may be used.
• Such dual cores are becoming popular
combination, ARM core and DSP core is a
popular combination.
List of Applications

– Communications
– Audio and video processing
– Graphics, image enhancement
– Navigation, radar, GPS
– Robotics
Embedded System & DSP Processor
• For simple embedded systems like printer,
motor control, systems like washing machine
and dish washing machine need only MCU
– No need of DSP processor
• For complex applications like ABS (Automatic
Breaking System)
– DSP processor is mandatory
DSP processor
• As a single unit, DSP processor has a
computational core which is similar in many ways
to GPP but has differences or enhancements to
make it cater to special requirements of DSP
computations
• Most DSP cores have peripherals just like an
MCU they have an I/O capability with serial ports,
SPI, I2C and parallel ports, timers, DMA, ADC,
DAC and special peripherals for special
applications like dedicated peripherals for audio
and video.
Manufactures of DSP
• Two leading designers of DSP processors are
Texas Instruments(TI) and Analog devices.
• Besides these two, Lucent and Motorola also
have market share in this item.
General features of DSP
• Features are need driven
• Come directly from signal processing
algorithms
• These processors should be able to perform
involved computations in such algorithms
much faster and efficiently than a GPP can do
it.
General features of DSP
• E.g. A very common and basic DSP algorithm
which is FIR filter design.
• The computation involved is as:

• Where y(n) is current output


• x(n-i) are delayed values of inputs
• ai’s are the filter cofficients .
• Each z-1 block causes one unit of delay.
General features of DSP
1. Fast MAC Units
– Multiply and Accumulate
Example:
Operations involved here

• Signal samples arrive continuously and as they


do, they and their delayed versions are
multiplied with the filter cofficients and
products are added.
• Multiply and accumulate operation is
necessary for FIR filter.
1. Fast MAC Units
• Features of a DSP processor which does not
apply to GPP
• The data for processing is infinite stream
which is to be processed in real time
• The probability is that intense arithmetic
processing and very little of branching and
control is required
1. Fast MAC Units
• Such a necessity suggests that having hardware
based ‘multiply and accumulate units’ is a
foremost necessity.
• Since is basic filtering operation, so
multiplication and accumulation must be fast.
• This entails need for specialized, high speed
multiplication algorithms and dedicated
hardware for it.
• There is usually a ‘single cycle MAC’ in which
complete multiply and accumulate completes
within one cycle.
1. Fast MAC Units
• All DSP have MAC as a single unit of
computation
• Fast MAC Units
– Multiply and Accumulate
2. Specialized Instructions
– Rather than writing a program to perform an
operation, it is convenient to have Single
instruction for commonly used operations

– For specialized DSP processors, like those which


cater to specific application like video or audio or
networking, there should be specialized
instructions pertaining to application scenario.
3. Efficient Memory Accessing
All MCUs use Harvard arch. (program and data
memory spaces are disjoint)
DSP algorithms have some special features
• Can be exploited to get still better performance
• Such algorithms generally perform same set of
instructions
• no point in fetching these instructions from
program memory again and again
3. Efficient Memory Accessing
- Use instruction cache in CPU
- stores most recent instructions in instruction
cache
- Instruction need to be taken from program
memory only once. For next round instructions
can be taken from instruction cache.
- In fig harvard architecture with separate address
and data buses connecting to separate data and
program memory.
Efficient Memory Accessing
Efficient Memory Accessing
• It is clear that program memory bus is free
once instructions are copied to instruction
cache
• In a filtering type of operation signal values
come from external source and may be
different.
• filter cofficient do not change, they are stored
in program memory and accessed in every
cycle
Efficient Memory Accessing
• For one multiplication data memory and
program memory are accessed concurrently
We get
– Signal sample from data memory
– Cofficients from program memory
– Instruction from instruction cache
4. DMA for Input Data
- Data memory stores input data (like signal values)
- Input data is usually transferred to data memory
through a DMA mechanism, No need to pass
through CPU register
5. Circular Buffers
– When same data (coefficients or delayed signal
values) involved in computation
– Allow processor to access data sequentially and
wrap around to beginning address
5. Circular Buffers
• In case of filter each time a new sample appears,
nth sample is discarded and other n-1 samples
are used. The method then is to simple change
the pointer for latest sample in the circular list.
Parameters needed for such circular buffer
– Pointer to start of circular buffer in memory
– Pointer to end of array
– Pointer to most recent sample which gets modified as
new sample is acquired.
General features of DSP
• Circular Buffers
6. Zero-overhead Looping
– Looping implemented in hardware
– No expense of extra memory cycle for testing and
updating loop counter
7. Multiple Execution Units
– Enhanced architecture by increasing execution
units
– More number of MAC units, multipliers, etc
8. Address Generation Units
– Memory addresses are generally predictable in
DSP algorithms
– Modes like auto increment, modulo (circular), bit
reverse are useful
– Such special requirements of signal processing
algorithms warrant the inclusion of dedicated
address generation /calculation units and most
DSP processors have this.
Address Generation Units
Address Generation Units

• Fig shows a dedicated


address calculation unit
such supports modulo and
bit reversal arithmetic.
• In many processors, such
units are duplicated
because multiple addresses
are needed in each cycle.
9. Data Format
– Fixed Point Representation e.g. AD’s BlackFin
– Floating Point Representation e.g. SHARC family
Fixed Point representation
• A fixed representation chosen is usually two’s
complement form which can represent both
positive and negative numbers which may be
integers or fractions.
• The fixed point refers to corresponding manner in
which numbers are represented, with a fixed
numbers of digits after or before decimal point.
• DSP Processors use either 16 or 32 bits for fixed
point format, one bit is allotted for the sign.
Fixed Point Representation
• There is a scaling factor involved. Position of
decimal is fixed by virtue of scaling factor
e.g. the no.
4.567 can be taken integer 4567 with scaling
factor of 1/1000
For performing integer computations scaling
factor of all numbers involved must be same
Fixed Point Representation
1. If fixed point number becomes too large for
available word length, the programmer has to
scale down the number. In this process lower
bits can be dropped off
E.g. 45678.1234 if data width of 6 digits then
programmer represent it as 45678.1
2. If fixed point no. is small, the number of bits
actually used to represent it is small.
The programmer may decide to scale the
number up
Floating Point Representation
• Standardized IEEE 754 format
• In this format sign, exponent and mantissa
• 64-bit double precision format (1 bit for sign,
11 bits for exponent, 52 bits for magnitude of
mantissa)
32-bit single precision
Example
64-bit double precision
Comparing two formats
i. Dynamic Range: floating point representation
has a very large dynamic range

ii. Precision: floating point yields a higher precision


than fixed point processing

Fixed point can also achieve a lower


quantization noise for this rounding and
truncation rules must be specified and used as
part of programming
Size of intermediate registers
• In DSP many are the MAC computations
• Not a good practice To round/truncate the
product to word length of processor after
each multiplication or addition
• Intermediate results must stored in larger
registers and only final result is made to fit
into the processor register
Size of intermediate registers
• E.g.
1. for 16 bit fixed point format
– Product is 32- bit
– Size of intermediate register- 40 bits
2. 32-bit floating point format multiplication of
two 24-bit mantissa
Product-48-bit
Size of intermediate register-80 bits for mantissa
alone
Choosing between two formats
• Floating point is used only when computational
accuracy is most important criteria
e.g.
1. For still image and video, fixed point processor
2. For audio floating point processor
List of applications for two formats
Floating point Fixed Point
• RADAR and military • Consumer Electronics
• Medical • Hand held Devices
• Communications • Image and video
• Robotics
• Industry Control
10. Two Level Cache
– Older DSP processors did not use cache
– Used multiple banks of on-chip memory
– Used multiple bus sets to enable several memory
Accesses per instruction cycle

– New DSP processor uses two level cache


– Level 1 (for code and data), very fast and close to core
– Level 2, slower but better performance than off- chip
would provide
Two Level Cache
11. Programming Language
• Like any other processors, DSP processor can also
be programmed in high level language or
assembly language.
• Assembly language is efficient but cumbersome
• High level language is simpler but less efficient
• But now very efficient high level compilers are
available in the market and c is popular
programming language
• So now around 90% of programming is done in
HLLs and rest in assembly.
12. Real Time Operating Systems
– DSP applications need real-time output, thus time
constraint is an issue
13. Power Dissipation
– Lower power dissipation is biggest and most
emphasized figure of merit
– Vary both voltage and frequency of operation to
lower power consumption
– This translates directly to longer battery life for
portable appliances.
14. Streamlined I/O and Specialized
DSP Peripherals
– Use high speed parallel and serial I/O with DMA
– Low latency interrupts and interrupts controllers
– Serial ports like I2C and SIP are available in DSP
which allow fast access from SD card and similar
storage medium
– DSP processors have Specialized peripherals which
cater to specific applications
– A DSP which is recommended for video should
have specialized peripherals and connectors for
this.
15. Parallelism in the Processor
Architecture
– Methods
• Increase the number of operations that
can be performed in each instruction
(Complex Operations).
• Increase the number of instructions that
can be issued and executed in every
cycle.
–Superscalar architecture, VLIW
The Superscalar Architecture
– Used in high end General Purpose Processors (GPP),
e.g. Pentium

– They can be 2-way superscalar or 4-way superscalar


i.e. Number of instructions per cycle

– Use multiple execution units

– Execution units may be of different capacity


The VLIW Architecture
– Very Long Instruction Word
– utilizes Instruction Level Parallelism
– Used in DSP processors
– Many instructions are bundled together as one
word
• All these instructions are executed simultaneously
– Also use multiple functional units

– All execution units are of same power.


Parallelism in the Processor
Architecture
The VLIW Architecture (Characteristics)
– Each set consists of simple instructions.
– Instruction scheduling is done at compile time, i.e.
deterministic behavior.
– DSP performance for a wide range of algorithms is
easily adaptable.
– Architecture is scalable (More execution units)
– a = b+c; x = y+1
– d = a + e x = x+1
The VLIW Architecture
Example TI’s TM320C62xx
The VLIW Architecture
• This processor has upto 256 bits of instruction
words at a time, breaks them into as many as
eight 32-bit such instructions and passes them to
eight independent computational units
• Some times it is possible that All eight units are
active simultaneously but in some cases it may
not be possible.
• For VLIW to be effective there must be sufficient
parallelism in the code to occupy the many
execution units.
Superscalar Vs VLIW
The VLIW Architecture
• Examples of modern VLIW processors are from
DSP vendors
• TI’s TMS320C6xxx Series
• Analog Devices’s Tiger SHARC
• Joint venture of lucent and Motorola known as
StarCore
• VLIW and superscalar often suffer from higher
energy consumption compared to conventional
DSP because Speed is main emphasis in such
design
SIMD Techniques
– Single Instruction Multiple Data
– Data Parallelism
– Multiple data is operated by same instruction
– Added as an additional capability in DSP
processors and or some processor like SHARC,
there is a possibility of switching ON or OFF the
SIMD section
SIMD Techniques
Example

– Monochrome image data is one byte long


– A 64-bit register operates on 8 data items
– It can be used as four 16-bit operands, two 32-bit
operands or one 64-bit operands
SIMD Techniques
• Many of DSP and multimedia applications use
vectors of packed 8, 16 and 32 bit integers and
also floating point numbers that allow
potential benefits of SIMD architecture
• Ad’s Tiger SHARC is a DSP processor which
uses VLIW architecture with SIMD capabilities
• Fig shows six execution units of this processor
grouped into two sets of three.
General features of DSP
SIMD Techniques
Example

You might also like