0% found this document useful (0 votes)
2 views

DSP_presentation_Sumit 1

The document provides an overview of Digital Signal Processing (DSP) systems, detailing their architecture, application areas, and the specific requirements for DSP processors compared to general-purpose processors. It discusses various implementation approaches, including general-purpose computers, custom VLSI components, and dedicated DSP processors, highlighting their characteristics and limitations. Additionally, it covers the mathematical operations central to DSP, such as Multiply and Accumulate (MAC), and the importance of real-time processing in applications like medical imaging, radar, and speech processing.

Uploaded by

SUMIT DATTA
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DSP_presentation_Sumit 1

The document provides an overview of Digital Signal Processing (DSP) systems, detailing their architecture, application areas, and the specific requirements for DSP processors compared to general-purpose processors. It discusses various implementation approaches, including general-purpose computers, custom VLSI components, and dedicated DSP processors, highlighting their characteristics and limitations. Additionally, it covers the mathematical operations central to DSP, such as Multiply and Accumulate (MAC), and the importance of real-time processing in applications like medical imaging, radar, and speech processing.

Uploaded by

SUMIT DATTA
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 71

A Digital Signal Processing

System

Analog D Analog
Antialiasing Sample Reconst.
Signal Filter and Hold A/D S D/A Filter
Signal
in out
P
A perspective of the Digital Signal Processing
problem
Application areas
Medical Radar Speech Seismic Image
•••

Digital signal processing theory

Theoretical
problem Basic functions
modelling

Algorithms

Architechtures
Processor
Implementation instruction sets
and/or hardware
functions

Component technology
DSP APPLICATION CHARACTERISTICS

APPLICATION REQUIREMENT PROCESSOR ATTRIBUTES

REAL-TIME PROCESSING HIGH SPEED, HIGH THROUGHPUT

LARGE ARRAY OF DATA INSTRUCTIONS TO MOVE AND


PROCESS LARGE DATA ARRAYS

ALGORITHM INTENSIVE FAST MATHEMATICAL


COMPUTATIONS, SINGLE CYCLE
DSP OPERATIONS (MACD)

SYSTEM FLEXIBILITY GENERAL PURPOSE


PROGRAMMABILITY, EPROM,
MC/MP MODE
Different approaches to
hardware
implementation
1. HIGH SPEED GENERAL PURPOSE COMPUTERS

Programmable Expensive
Can be configured for Complex control
different applications I/O overheads

2. CUSTOM-DESIGNED VLSI COMPONENTS

Efficient design Application specific


Large throughputs High development cost
3. GENERAL PURPOSE DIGITAL SIGNAL PROCESSORS

Combine the Programmability & Control features of


general
purpose computers and the Architectural innovations
of
special purpose chips.

GOALS: HIGH SPEED, LOW POWER AND LOW COST


General purpose computers

1. Flexible
2. Suitable for Internet and
Multimedia application
3. Software Intensive
4. Slow for high speed application
5. Too bulky
6. Power hungry
Why are conventional Processors not
suitable for DSP?

 Caches are a waste of chip area

 Small register files force lots of memory


accesses

- these are different from cache since


these are program managed

 Complex instruction issue logic, branch


prediction, speculation etc. are not
needed for DSP

 Not enough ALU function


Data Processing vs
Signal Processing
• General-purpose microprocessors are designed
primarily for Data Processing.
– The primary burden is Data Read/Write

• Digital Signal Processors are Microprocessors


specifically designed for Signal Processing.
– The primary burden is Mathematical operation

• DSP architecture therefore incorporates certain


features not found in general-purpose P’s.
DSP Requirements
• Emphasis is on mathematical operations rather than
data manipulation operations like word processing,
database management etc.
– Design is optimized for DSP algorithms which
implement FIR filter, FFT generator etc.

• Processing is real-time, i.e. the input signal comes


continuously, and the output signal is also produced
continuously as the input is acquired.

• Dominant mathematical operation is Multiply and


Accumulate (MAC), on separate inputs in parallel.
Digital Signal Processor features

 Caters to high arithmetic demands

 Real time operation

 Analog input / output

 Large number of functional units for


a given size

 Small control Logic


Typical MAC Execution Cycle
• Obtain a sample of the Input signal
• Move the input sample into the input buffer
• Fetch the co-efficient from internal buffer
• Multiply the input sample by the co-efficient
• Add the product to the Accumulator
• Move the output to the output buffer
• Send it out as a sample of the output signal
MAC Execution Hardware

Data
Progra Read
m Address
Data Write
Counter
Address

*
Program/ Data
Coefficient Memory
Memory

CP ACC
Architecture of Digital Signal
Processors
• General-purpose processors are based on the Von
Neumann architecture (single memory bank and
processor accesses this memory bank thro’ single
set of address and data lines)

• Harvard architecture commonly used in DSP


processors
– Separate Data and Program memories (two memory
banks)
– Separate Address Buses for Data and Program memories
Additional Features in a DSP
Processor
• Instruction Cache and Pipelined processor
as in any modern microprocessor, but no
Data Cache

• Separate ALU, Multiplier and Shifter,


connected through multiple internal data
buses, enabling fast MAC operations
DSP, CISC and RISC
• DSP Processors can’t be called truly as
CISC or RISC-type of processors

• Some features present in a RISC processor


may exist. However, DSP processors are
“tuned” towards operations encountered in
signal processing applications
DSP IMPLEMENTATION APPROACHES

Important desirable characteristics

 Adequate word length


 Fast multiply & accumulate
 High speed RAM
 Fast Coefficient table addressing
 Fast new sample fetch mechanism
DSP functions implemented
with IC chips
Issues:

Speed Architectural features

Accuracy Register lengths and


floating point capability

Cost Advances in VLSI


techniques
GENERAL PURPOSE DSP FEATURES
1. PARALLELISM: Multiple Functional Units
Multiple Buses
Multiple Memories

2. PIPELINING
3. HARDWARE MULTIPLIERS AND OTHER ARITHMETIC FUNCTIONS
4. ON-CHIP AND CACHE MEMORIES
5. A VARIETY OF ADDRESSING MODES
7. INSTRUCTIONS THAT PACK SEVERAL OPERATIONS
8. ZERO-OVERHEAD LOOPING
9. I/O FEATURES SUCH AS INTERRUPT, SERIAL I/O, DMA
10. OTHER CONTROL FUNCTIONS SUCH AS WAIT STATES
A second order FIR
filter
x(n) x(n-1) x(n-2)
Delay Delay

h(0) h(1) h(2)

y(n)
+ +

y(n) = h(0)x(n) + h(1)x(n -1) + h(2)x(n-2)


x(n+1)

x(n) Delay h(0)

x(n-1) h(1)
Delay
ar1 x(n-2) ar2 h(2)

MAC

y(n)
Organization of signal samples and filter coefficients
for a second order FIR filter implementation
An Nth order FIR filter implementation

A[0] X[n]
A[1] X[n-1]
A[2] X[n-2]
*
•• ••
•• P ••
•• ••

A[N-1] +
X[n-
N+1]
y[n]
ACC

Coefficient Data
Memory Memory
FIR Filter pseudo-code
Load loop count
Initialize coefficient and data addr regs
Zero Acc and P registers
LOOP: Pnew = A[i] . X[n-i]
Accnew = ACCold + Pold
Decrement coefficient and data addr
regs
X[n-i] X[n-i-1] {for next iteration}
Decr loop count
BNZ LOOP
Acc Y[n]
A Typical DSP Architecture
PM Data DM Data
PM Address Address Address DM Address
Program
Memory Generator Generator Data
(PM) Memory
(DM)
Instruc- Program Sequencer
tions & Instruction Cache Data
secondary PM Data DM Data
only
data
Registers DMA Bus
I/O
Multiplier Controller
(DMA)
ALU

Shifter
Input/Output
Salient Features
• REPEAT-MAC instruction
- Performs auto-increment of both coefficient and data
pointers
- Frees up program memory bus for fetching
coefficients
• Circular buffer
- to manage data movement at the end of every output
computation
• Handling precision
- Accumulator guard bits
- Saturation mode
- Shifters (both right and left shift)
Types of multipliers used

• Array multipliers
• Multipliers based on modified Booth’s
algorithm
Product Computation Unit of a
simple multiplier for 4-bit
unsigned numbers X and Y
Multiplying X and Y: summation
unit of the simple multiplier
Combinational array for Booth’s
algorithm – Basic cell B
Array Multiplier for 4x4-bit
numbers using basic cell B
Arithmetic
Fixed point Vs Floating point
Array indices, Loop Wider dynamic range
counters etc. frees user from scaling concerns
Less sensitive to error
accumulation

Overflow/underflow 50% slower for same


management technology
Error budget for Higher Cost
word length growth Normalize after each operation
Mantissa round off (some
accuracy is traded)
Fixed point does not always limit
performance:

e.g., for dynamic range of 50 to 60 dB, 12 -bit


quantization (step size of -72 dB) is more than
adequate. For Hi-fi audio with 80 dB dynamic
range, 16 bits (-96 dB) are more than
adequate
Overflow Management
SHIFT

Left shift removes redundant sign bit after 2’s complement


multiplication
Right shift down scales numbers as word growth is detected

Unbiased rounding

Prevents accumulation of a small dc bias from outputs


which fall just half way between adjacent rounded values
Saturation Logic

Sets the contents of register to maximize the


value if overflow occurs

Block Floating Point

Scaling logic + exponent register: If overflow


condition of any point is detected, the entire
array is rescaled downwards and the scaling is
stored in the block exponent register.
SHIFTERS
- Scales numbers to prevent overflow/underflow
- Conversion between fixed point and floating point
- Many bits must be shifted in a single cycle to preserve
single cycle computational speed (Barrel Shifter)
- Logical shift assumes unsigned data and fills with
zeroes left or right
- Arithmetic shift scales numbers upwards (left) or
downward (right)

zero fills sign extend


- Normalization/de-normalization for block floating point
Memory
Traditional µPs : register to register
(limited memory
bandwidth)
DSPs : memory to memory
(higher memory bandwidth)
upto six memory fetches in an inst. cycle
Parallel memory banks: small, fast and
simple memories.

Internal Vs External

Pincount limitation 
Speed penalty Off-chip bussing

Internal busses are multiplexed to the


MEMORY ORGANISATION - I

BASIC HARVARD ARCHITECTURE

PROGRAM DATA
MEMORY MEMORY

MODIFICATION #1 MODIFICATION#2
PROGRAM MULTI-PORT
DATA PROGRAM
/DATA DATA
MEMORY MEMORY
MEMORY MEMORY
MEMORY ORGANISATION - II
MODIFICATION #3
PROGRAM/
PROGRA DATA
DATA
M CACHE MEMORY
MEMORY

MODIFICATION #4

PROGRA DATA DATA


M MEMORY MEMORY
MEMORY

MODIFICATION #5

I/O PROGRAM DATA DATA DATA


DATA
PROGRAM MEMORY MEMORY MEMORY MEMORY
MEMORY
MEMORY
Addressing modes
Parallel memory inst. must specify upto 3 memory
accesses
Number of bits required is very large
More memory, wider busses, more memory cycle

Solution: register - indirect addressing modes


Many addresses in one word inst.
Only a few bits are required since
register bank is small ; parallel hardware
to update registers containing memory
addresses.
Addressing Modes (contd.)
• Short immediate addressing mode
• Short direct addressing mode
• Memory mapped addressing mode
• Circular buffer addressing mode
• Bit-reversed addressing mode
Instruction Level Parallelism

VLIW architecture
• Each instruction specifies several
operations to be done in parallel
• Advantages : Simple hardware
compilers can spot ILP
easily
• Disadvantages : Little compatibilty between
generations
Explicit NOPs bloat code
Super scalar architecture

• Hardware responsible for finding ILP in


a sequential program

• Advantage : Compatibility between


generations

• Disadvantage : Very complex


hardware
Explicitly Parallel Instruction
Computing (EPIC)
• Combines VLIW and super scalar
architectures
• Instructions are grouped into 3
operating blocks and a template block
• Template block tells hardware if
instructions can be executed in parallel
• Also gives information whether the
block can be executed in parallel
ILP versus Power

Increasing instructions / cycle


 Requires fewer cycles to execute a task
 Uses longer clock for same performance
 Uses lower supply voltage
 And hence uses less power
However, too many functional units and too
many transitions per clock cycle increase
power consumption.
Low Power architecture

 Power consumed by additional circuits vs.


ability to lower clock rate while maintaining
performance

 Circuits must be highly used

 Move complexity into software

 Voltage scaling : Reduce Vdd

 Clock gating : Turn off clock when chip


is not in use ( applies to
sub-modules of chip also)
 VLIW is more suitable than super scalar
for low power
- VLIW is smaller for same number of
functional units
- Compiler is better at finding
parallelism than hardware
 Put multiple processors on chip rather
than lots of functional units in one processor
 Helps in running independent tasks
Improvement of Speed by
Pipelining
• Processor speed can be enhanced by having separate
hardware units for the different functional blocks,
with buffers between the successive units.
– The number of unit operations into which the instruction
cycle of a processor can be divided for this purpose
defines the number of stages in the pipeline.
– A processor having an n-stage pipeline would have up to
n instructions simultaneously being processed by the
different functional units of the processor.
• Effective processor speed increases ideally by a
factor equal to the number of pipelining stages.
A Four-stage Pipeline
Data Dependency in Pipelining
If the input data for an instruction depends on the
outcome of the previous instruction, the Write cycle of
the previous instruction has to be over before the
Operate cycle of the next instruction can start. The
pipeline effectively idles through one instruction,
creating a bubble in the pipeline which persists for
several instructions.
F1 D1 O1 W1
F2 D2 idle O2 W2
F3 idle D3 O3 W3
Bubble

ends
F4 D4 O4 W4
here
Example of dependency
• A  3 + A; B  4 x A
Can’t perform these two in parallel
• Another case: A = B + A; B = A – B; A =
A – B (swapping without temp) ; examine
how you can handle this.
Branch Dependency in Pipelining
A Branch instruction can cause a pipeline stall if the branch
is taken, as the next instruction has to be aborted in that
case. If I1 is an unconditional branch instruction, the next
Fetch cycle (F2) can start after D1. But if I1 is a conditional
branch instruction, F2 has to wait until O1 for the decision as
to whether the branch will be taken or not.

F1 D1 O1 W1 branch instruction

F2 D2 O2 W2 executed if branch is not taken


executed for
F2 D2 O2 W2
unconditional branch
F2 D2 O2 W2 for conditional
branch, if taken
Pipeline in ADSP 219x
Processors
6-Stage Instruction Pipeline with Single-cycle
Computational Units:
• Look-Ahead: places the address of an instruction that
is going to be executed several cycles down the road,
on the program-memory address (PMA) bus
• Pre-fetch: Pre-fetches an instruction if the instruction
is already in the instruction cache
• Fetch: Fetches the instruction that was “looked ahead”
2 cycles ago
• Address-decode: Decoding of the DAG operand fields
in the opcode in this cycle
• Decode: The second stage of the instruction decoding
process, where the rest of the opcode is decoded
• Execute: Instruction is executed, status updated, results
written to destinations
Causes for Pipeline Stalls
 Memory block conflicts: If both instruction and data are to be
fetched from the same block of memory, a stall is
automatically inserted
 DAG usage immediately (or within 2 cycles) after
initialization. e.g.
 I2 = 0x1234;
 AX0 = DM(I2,M2);
 Bus conflicts: Instructions which use the PMA/PMD buses for
data transfer may cause bus conflict. e.g.
 PM(I5,M7)=M3;
Avoiding DAG-related Pipeline
Stalls
• Note that
– I2 = 0x1234;
– I3 = 0x0001;
– I1 = 0x002;
– AX0 = DM(I2,M2);
will NOT cause a stall.
• Also, note that switching DAG register
banks (primary  secondary) immediately
before using them will NOT cause a stall.
TMS320C25 KEY FEATURES
+5 v GND

100 ns INSTRUCTION CYCLE TIME


INTERRUPTS DATA
288 x 16 256 x 16
DATA RAM DATA/ 128K-WORDS TOTAL MEMORY
16
PROGRAM
MULTI- SPACE
PROCESSOR
INTERFACE
THREE PARALLEL SHIFTERS
4K x 16 PROGRAM ROM
SERIAL 133 GENERAL PURPOSE AND DSP
INTERFACE
INSTRUCTIONS
32-BIT ALU/ACC
S/W UPWARD COMPATIBLE WITH
ADDRESS PREVIOUS FAMILY MEMBERS
16 x 16
MULTIPLIER 16 1.8u CMOS: 68-PIN PLCC / PGA
TMS320C25 GENERAL PURPOSE FEATURES

COMPREHENSIVE INSTRUCTION SET-133


INSTRUCTIONS INCLUDING -
NUMERICAL (34) -
X=X-Y LOGICAL (15) -
MEMORY MANAGEMENT (33) -
BRANCHES (20) -
PROGRAM/MODE CONTROL (31)
EXTENDED-PRECISION ARITHMETIC
SERIAL PORT (DOUBLE BUFFERED,
BIT STATIC)
NO
16=0 MULTIPROCESSOR INTERFACES
(CONCURRENT DMA, GLOBAL DATA
MEMORY)
YES
BLOCK MOVES (UP TO 10 M
WORDS/SEC)
OUTPUT X ON-CHIP TIMER
THREE EXTERNAL MASKABLE
INTERRUPTS
TMS320C25 ALU

DESIGN & OPERATION


32-BIT ALU & ACCUMULATOR
CARRY BIT FOR EXTENDED
PRECISION
OVERFLOW DETECTION &
SATURATION
SIGN EXTENSION OPTION
0-16 BIT PARALLEL SHIFTER FOR
LOADS AND ARITHMETIC OPS
SHIFTERS ON PRODUCT
REGISTER OUTPUT DATA
0-7 BIT PARALLEL SHIFTER FOR
ACCUMULATOR STORES
TMS320C25 - MULTIPLY INSTRUCTIONS II

MAC MPY data memory *


program memory &
add past P-Reg to ACC

MACD MPY data memory *


program memory, add
past P-Reg to ACC, & move
data memory

SQRA Square data memory value


& add past P-Reg to
ACC

SQRS Square data memory value


TMS320C25 SPECIAL PURPOSE FEATURES

Xi SINGLE-CYCLE
Z -1 Z -1 Z-1
Z-1
MULTIPLY/ACCUMULATE
MULTIPLY/ACCUMULATE
USING EXTERNAL
PROGRAM MEMORY
REPEAT INSTRUCTION
ADAPTIVE FILTERING
Yi
 INSTRUCTIONS
BIT-REVERSED
0-16 BIT SCALING SHIFTER (SIGNED ADDRESSING
OR UNSIGNED) AUTOMATIC DATA-MOV
OVERFLOW MANAGEMENT IN MEMORY (Z-1)
-SATURATION MODE
-BRANCH ON OVERFLOW
-PRODUCT RIGHT SHIFT
TMS320C25 - HIGHER PERFORMANCE AT LESS CODE
SPACE
xn
Z-1 Z-1 Z-1 Z-1

x x x x

Yn

N

Yn = b K X(n-K) TMS320C25
K=0
RPTK 49
MACD

3 WORDS PROG MEMORY


53 CYCLES
TMS320C25 ADDRESSING MODES
 IMMEDIATE ADDRESSING - BOTH LONG AND SHORT
Program memory
CONSTANTS - EXAMPLES: ADDK 5
ADDK 5
ADDLK
ADLK > 1325
1325
 DIRECT ADDRESSING - SAME AS TMS320C1X
BUT DP IS 9 BITS - 512 “BANKS” OF 128 From
WORDS - USED OFTEN FOR LONG
DP instructio
SEQUENCES OF IN-LINE CODE n
9 BITS 7 BITS

 INDIRECT ADDRESSING - B AUXILIARY


REGISTERS - USED OFTEN IN OPERAND ADDRESS
PROGRAM LOOPS WITH AUTO INC/DEC OPTIONS
Addressing Mode (contd.)

• Circular buffer addressing mode

• Bit-reversed addressing mode


TMS320C25 AUXILIARY REGISTER INSTRUCTIONS

LAR Load aux-reg w/data

LARK Load AR w/8-bit constant

LRLK Load AR w/16-bit


constant

MAR Modify auxiliary register

SAR Store auxiliary register

ADRK Add 8-bit constant to AR

SBRK Sub 8-bit constant from


AR

LARP Load auxiliary register


pointer
TMS320C25 ON-CHIP MEMORY

MEMORY ORGANIZATION
4K WORDS ON-CHIP
MASKED ROM
544 WORDS ON-CHIP
DATA RAM
256 WORDS ON-CHIP
RAM RECONFIGURABLE
AS DATA/PROGRAM
MEMORY
BLOCK TRANSFERS IN
MEMORY
DIRECT, INDIRECT, AND
IMMEDIATE ADDRESSING
MODES
BLOCK DIAGRAM OF A TMS320C5X DSP
General-Purpose Microprocessor
circa 1984 : Intel 8088

 ~100,000 transistors
 Clock speed : ~ 5 MHz
 Address space : 20 bits
 Bus width : 8 bits
 100+ instructions
 2-35 cycles per instruction
 Micro-coded architecture
DSP TMS 32010 1984
 Clock 20 MHz
 16 bits
 8, 12 bits addressing space
 ~ 50 k transistors
 ~ 35 instructions
 Harvard architecture
 Hardware multiplier
 Double length accumulator with
saturation
 A few special DSP instructions

General Purpose Microprocessor 2000
 GHz clock speed
 32-bit address or more
 32-bit bus, 128-bit instructions
 Complex MMU
 Super scalar CPU
 MMX instructions
 On chip cache
 Single cycle execution
 32-bit floating point ALU on board
 Very expensive
 10s of watts of power
DSP in 2000
 Clock 100 ~ 200 MHz
 16-bit floating point or 32-bit floating
point
 16-24 bits address space
 Large on-chip and off-chip memories
 Single cycle execution of most
instructions
 Harvard architecture
 Lots of special DSP instructions
 50 mw to 2w power
Future of DSP Microprocessor
 Sufficiently unique for an independent
class of applications (HDD, cell phone)
 Low power consumption, low cost
 High performance within power, cost
constraints (MIPS/mw, MIPS/$)
 Fixed point & floating point
 Better compilers - but users must be
informed
 Hybrid DSP/ GP systems
DSP
Architecture
s
Professor S. Srinivasan
Electrical Engineering Department
I.I.T.-Madras, Chennai –600 036
[email protected]

You might also like