DSP Architecture

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 31

Class Presentation of

Custom DSP Implementation Course on:


ECE Department University of Tehran

TMS320C54x DSP
processor
Presented by:

Shahab adin Rahmanian


May 2005
This is a class presentation. All data are copy rights of their
respective authors as listed in the references and have been
used here for educational purpose only.

Outline
Introduction
Architecture
Applications
features
Instruction Set and addressing
FIR Filtering
Accelerating Polynomial Evaluation
Numerical Issues
Write code in C
Conclusion

Introduction

[2]

TMS320C54x
a fixed-point digital signal processor (DSP) in the TMS320 family.
Low power DSP
: 0.54 mW/MIP
Acceleration for FIR and LMS filtering, code book search,

polynomial evaluation, Viterbi decoding ,Fast Fourier transform

[4]

Some Typical Applications


General-Purpose
Adaptive filtering
Digital filtering
Fast Fourier transforms
Control
Disk drive control
Laser printer control
Robotics control
Military
Missile guidance
Radar processing
Secure communication
Telecommunications
1200- to 19200-bps modems
Adaptive equalizers
Cellular telephones
Echo cancellation
Video conferencing

Software Applications
Circular Buffers
Single-Instruction Repeat (RPT) Loops
Extended-Precision Arithmetic

Addition and Subtraction


Multiplication
Division
Square Root

Floating-Point Arithmetic
Application-Oriented Operations
Symmetric FIR Filters
Adaptive Filtering
Viterbi Algorithm for Channel Decoding

Fast Fourier Transforms

Some key features


CPU

Advanced multi bus architecture with three separate


16-bit data buses and one program bus
40-bit arithmetic logic unit (ALU), including a 40-bit
barrel shifter and two independent 40-bit
accumulators
17-bit 17-bit parallel multiplier coupled to a 40-bit
dedicated adder for non-pipelined single-cycle
multiply/accumulate (MAC) operation
Memory
192K words 16-bit maximum addressable memory
space (64K words program, 64K words data, and 64K
words I/O)
28K words 16-bit single-access on-chip ROM with
8K words configurable as program or data memory
(C541 only)

Some key features


On-chip peripherals

On-chip phase-locked loop (PLL) clock


generator with internal oscillator or external
clock source
Two full-duplexed serial ports to support 8and 16-bit transfers (C541only)
Time-division multiplexed (TDM) serial port
(C542/C543 only)
One 16-bit timer
Speed: 25/20-ns execution time for a singlecycle fixed-point instruction (40 MIPS/50 MIPS)
with 5-V power supply

C54x Addressing Modes


Immediate

Operand is part of
the instruction

ADD #0FFh

Absolute

Address of operand
is part of the
instruction
Register

Operand is
specified in a
register

LD *(LABEL), A

READA DATA
;(data read
from address in
accumulator A)

C54x Addressing Modes


Direct

Address of operand is part


of the instruction (added
to implied memory page)

ADD 010h,A

Indirect

Address of operand is
stored in a register
Offset addressing
Register offset (ar1+ar0)
Autoincrement/decrement
Bit reversed addressing
Circular addressing

ADD *AR1
ADD *AR1(10)
ADD *AR1+0
ADD *AR1+
ADD *AR1+B
ADD *AR1+0B

C54X Instructions Set by Category


Arithmetic
ADD
MAC
MAS
MPY
NEG
SUB
ZERO

Logical
AND
BIT
BITF
CMPL
CMPM
OR
ROL
ROR
SFTA
SFTC
SFTL
XOR

Program
Control
B
BC
CALL
CC
IDLE
INTR
NOP
RC
RET
RPT
RPTB
RPTZ
TRAP
XC

Application
Specific
ABS
ABDST
DELAY
EXP
FIRS
LMS
MAX
Data
MIN
Management
NORM
LD
POLY
MAR
RND
MV(D,K,M,P)
SAT
ST
SQDST
SQUR
Notes
SQURA
CMPL complement
MAR modify address reg.
SQURS
CMPM compare memory
MAS multiply and subtract

Block FIR Filtering


y[n] = h0 x[n] + h1 x[n-1] + ... + hN-1 x[n-(n-1)]

; Addresses:
a4 as
h, linear
a5 N samples
a6 input (in
buffer,
a7mem.)
output
h stored
array ofofNx,elements
prog.
buffer
x stored
as circular
array
of N
data
mem.)
; Modulo
addressing
prevents
need
to elements
reinitialize(in
regs
each
sample
; Moving filter coefficients from program to data memory is not
shown
firtask: ld
#firDP,dp
; initialize data page
pointer
stm
#frameSize-1,brc
; compute 256 outputs
rptbd firloop-1
stm
#N,bk
; FIR circular buffer size
ld
*ar6+,a
; load input value to
accumulator b
stl
a,*ar4+%
; replace oldest sample
with newest
rptz
a,#(N-1)
; zero accumulator a, do
N taps
mac
*ar4+0%,*ar5+0%,a; one tap, accumulate in a

Accelerating Symmetric FIR Filtering


Coefficients in linear phase filters are either

symmetric or anti-symmetric
Symmetric coefficients using 2 mults 3 adds
y[n] = h0 x[n] + h1 x[n-1] + h1 x[n-2] + h0 x[n-3]
y[n] = h0 (x[n] + x[n-3]) + h1 (x[n-1] + x[n-2])
Accelerated by FIRS (FIR Symmetric) instruction

x in two
circular
buffers

h in
program
memory

Accelerating Symmetric FIR Filtering


; Addresses: a6 input buffer, a7 output buffer
; a4 array with x[n-4], x[n-3], x[n-2], x[n-1] for N = 8
; a5 array with x[n-5], x[n-6], x[n-7], x[n-8] for N = 8
; Modulo addressing prevents need to reinitialize regs each
sample
firtask:
ld
#firDP,dp
; initialize data page
pointer
stm #frameSize-1,brc
; compute 256 outputs
rptbd
firloop-1
stm #N/2,bk
; FIR circular buffer size
ld *ar6+,b
; load input value to accumulator b
mvdd
*ar4,*a5+0% ; move old x[n-N/2] to new x[nN/2-1]
stl b,*ar4%
; replace oldest sample with newest
add *a4+0%,*a5+0%,a ; a = x[n] + x[n-N/2-1]
rptz b,#(N/2-1)
; zero accumulator b, do N/2-1 taps
firs *ar4+0%,*ar5+0%,coeffs
; b += a * h[i], do next a
mar *+a4(2)%
; to load the next newest sample
mar *ar5+%
; position for x[n-N/2] sample
sth b,*ar7+
firloop:
ret

Architecture - FIRS

Accelerating Polynomial Evaluation


Function approximation and spline interpolation
Fast polynomial evaluation (N coefficients)

y(x) = c0 + c1 x + c2 x2 + c3 x3
Expanded form
y(x) = c0 + x (c1 + x (c2 + x (c3))) Horners form
POLY reduces 2 N cycles using MAC+ADD to N cycles
; ar2 contains address of array [c3 c2 c1 c0]
; poly uses temporary register t for multiplicand x
; first two times poly instruction executes gives
; 1. a = c(3) + x * 0 = c(3); b = c2
; 2. a = c(2) + x * c(3);
b = c1
ld *ar2+,16,b
; b = c3 << 16
ld *ar3,t
; t = x (ar3 contains addr of x)
rptz a,#3
; a = 0, repeat next inst. 4
times
poly *ar2+
; a = b + x*a || b = c(i-1) << 16
sth a,*ar4
; store result (ar4 is addr of y)

Integer Multiplication
Integer multiplication yields products larger than the inputs, as

can be seen in the example below, using single digit decimal


values as inputs:

Does the user store the lower (1) or upper (8) result?

Both must be kept, resulting in additional resources (two


cycles ,words of code, and RAM locations) to complete the
store.
Worse, how can the double-sized result be used recursively as
an input in later calculations, given that the multiplier inputs
an input in later calculations, given that the multiplier inputs
are single-width?

Fractional Multiplication
Multiplication of fractions yields products that never exceed

the range of a fraction, as can be seen in the example below,


using single digit decimal fractions as inputs:

Dont we still have a double sized result to store?

In this case, we can store just the upper result (.8)


This allows storage of result with fewer resources
Results may be used recursively
Has accuracy been lost by dropping the lower accumulator value?

Accuracy vs. Precision


Often the programmer wants to retain the fullest

accuracy of a calculation, thus dropping the 16


LSBs of the result in the previous example seems a
bad choice.
Note though, the inputs: how much accuracy do
they offer?
The product offers double precision but its
accuracy is based on the single-width inputs.
Thus, storing a single precision result is not only an
efficient solution, but represents the limit of the
accuracy of the result.
The accumulator is double-sized for two reasons:
To allow for integer operations, which would
possibly require the LSBs for the result.
So that sum-of-product operations will generate
accumulative noise at the 32nd vs. the 16th bit.

Redundant Sign Bit


Multiplication of two signed
numbers yields product with
two sign bits
Extra sign bit causes
problems if stored to memory
as result:
Wastes space
Creates off-size Q
Solution: Fractional mode
bit!
When FRCT (mode bit in
ST1) is set, the multiplier
output is left-shifted by one
For 16-bit C54x:
Q1 5*Q1 5=Q1 5

Accumulation
With fractions, we were able to guarantee that

no multiplicative overflow could occur, ie:


F*F<=F.
For addition, this rule does not apply, ie: F+F>F.
Therefore, we need additional measures to
manage the possibility of overflow for
accumulation. Two general methods apply:
Guard Bits: the C54x offers an 8-bit
extension above the high accumulator to
allow valid representation of the result of up
to 256 summations.
Non-gain Systems: offer additional criteria
that allow a simple solution for unlimited
length summations.

Guard Bits and saturation


Guard Bits: the C54x offers an 8-bit extension above

the high accumulator to allow valid representation of


the result of up to 256 summations.

Saturation (SAT)
SAT instruction saturates value exceeding
32-bit range in the selectedSAT
accumulator:
A
SAT B

Non-gain Systems

Many systems can be modeled to have no DC gain:


Filters with low Q.
Any systems scaled by its maximum gain value.
Input values from A/D converters are automatically
fractions, if the limits of the A/D are presumed to be +/-1
Coefficient values can similarly bonded by making the
largest value the scaling factor for all other values.
For these systems, it is known that the final value of the
process is less than or equal to the input values.
The accumulator therefore can be allowed to temporarily
overflow, since the final result is known to be bonded +/1.
Allows maximum usage of selected A/D and D/A
converters
D/A bits for gain are more expensive than using analog
components

Division
The C54x does not have a single cycle 16-bit divide

instruction
Divide is a rare function in DSP
Division hardware is expensive
The C54x does have a single cycle 1-bit divide
instruction: conditional subtract or SUBC

Preceded by RPT #15, a 16-bit divide is performed


Is much faster than without SUBC

The SUBC process operates only on unsigned operands,

thus software must:

Compare the signs of the input operands

If they are alike, plan a positive quotient


If they differ, plan to negate (NEG) the quotient

Strip the signs of the inputs


Perform the unsigned division
Attach the proper sign based on the comparison of the
inputs

Division Routine
B = num*den (tells sign)
Strip sign of numerator
Strip sign of denominator
16 iterations
1-bit divide
If result needs to be
negative
Invert sign
Store negative result

Rounding
Result of multiplication can be rounded for MPY,
and MAS operations. This is specified by appending the

instruction with an R suffix.


Example: MAC with rounding is MACR. Rounding consists of
adding 215 to the result and then clearing the low accumulator.
In a long sum-of-products, only the last MAC operation should
specify rounding:

Rounding can also be achieved with a load


operation:

Sign Extension (SXM)

Write code in C
Inline Assembly
Allows direct access to assembly language from C
Useful for operating on components not used by
C, ex:

Note: first column after leading quote is label field


Long operations should be written in ASM and called

from C
main C file retains portability
yields more easily maintained structures
eliminates risk of interfering with registers in use by C

Accessing MMRs from C


Using pointers to access Memory-Mapped

Registers:

Create a pointer and set its value to the assigned memory


volatile
unsigned int *SPC_REG = (volatile unsigned int *) 0x0
address:

Read
and write to the register as any other pointer:
*SPC_REG=OxC

8;

Accessing I/O Ports from C


1. create the port:
2. access the port:

ioport unsigned
port8000
x = port8000;
port8000 = y;

Summary and Conclusion


C54x is a conventional digital signal processor

Separate data/program busses (3 reads & 1


write/cycle)
Extended precision accumulators
Single-cycle multiply-accumulate
Saturation and wraparound arithmetic
Bit-reversed and circular addressing modes
C54x has instructions to accelerate algorithms
Communications: FIR & LMS filtering, Viterbi decoding
Speech coding: vector distances for code book search
Interpolation: polynomial evaluation

References
[1] Texas instrument TMS320C54x DSP Design
Workshop
May 1997
[2] TMS320C54x Users guide
[3] www.ti.com
[4] SIGNAL AND IMAGE PROCESSING ON THE
TMS320C54x DSP by Prof. Brian L. Evans
[5] TMS320C54x Assembly Language Tools

You might also like