1 DSP Processor

What is Processor ?
1
Languages
Machine Level Language
In the form of 0s and 1s hence recognize by machine (Processor)

Difficult to write, edit and debug the programs
Medium Level Language (Assembly Language)
In the form of mnemonics. Mnemonics are English word which

denotes the process
The mnemonics need to be translated to machine level language
before execution. Assembler is used for these translations. The
assembler is the program which converts the assembly language to
machine language.
Cross assembler is program that allows a computer program
written on one type of computer to be used on another type.
2
Language
High Level Language

In the form of statement
Easy to write, edit, and debug the program
It required complier to convert the high level language to machine
level language.
3
Language
Instruction
Execution cycle
Machine Cycle
Subroutine
Interrupts
4
Conventional Processors and DSP Processors
5
The computers are extremely capable in two broad areas,
(1)data manipulation, such as word processing and database
management, and
(2)mathematical calculation, used in science, engineering, and Digital
Signal Processing. All microprocessors can perform both tasks;
however, it is difficult (expensive) to make a device that is optimized for
both. There are technical tradeoffs in the hardware design, such as the
size of the instruction set and how interrupts are handled.
Even more important, there are marketing issues involved:
development and manufacturing cost, competitive position, product
lifetime, and so on. As a broad generalization, these factors have made
traditional microprocessors, such as the Pentium®, primarily directed at
data manipulation. Similarly, DSPs are designed to perform the
6
mathematical calculations needed in Digital Signal Processing.
(1) Data manipulation, Data manipulation involves storing and sorting
information. Consider a word processing program. The basic task is
to store the information (typed in by the operator), organize the
information (cut and paste, spell checking, page layout, etc.), and
then retrieve the information (such as saving the document on a
floppy disk or printing it with a printer). These tasks are
accomplished by moving data from one location to another,
7
(2) Mathematical calculation, the execution speed of most DSP
algorithms is limited almost completely by the number of
multiplications and additions required.
consider the implementation of an FIR digital filter, the input signal is
referred to by x[ ], while the output signal is denoted by y[ ]. Our task
is to calculate the sample at location n in the output signal, i.e., y[n].
An FIR filter performs this calculation by multiplying appropriate
samples from the input signal by a group of coefficients, denoted by:
a0, a1, a2, a3, …, and then adding the products. In equation form, y[n]
is found by:
8
Here the input signal has been convolved with a filter kernel (i.e., an
impulse response) consisting of a0, a1, a2, a3, an. Depending on the
application, there may only be a few coefficients in the filter kernel, or
many thousands. While there is some data transfer and inequality
evaluation in this algorithm, such as to keep track of the intermediate
results and control the loops, the math operations dominate the
execution time.
9
Offline processing
In the offline processing the entire input signal resides in the computer
at the same time.
For example, a geologist might use a seismometer to record the
ground movement during an earthquake. After the shaking is over, the
information may be read into a computer and analyzed in some way.
Another example of off-line processing is medical imaging, such as
computed tomography and MRI. The data set is acquired while the
patient is inside the machine, but the image reconstruction may be
delayed until a later time. The key point is that all of the information is
simultaneously available to the processing program. This is common in
scientific research and engineering, but not in consumer products. Off-
line processing is the area of personal computers and mainframes. 10
Real time processing
In real-time processing, the output signal is produced at the same time
that the input signal is being acquired.
For example, this is needed in telephone communication, hearing aids,
and radar. These applications must have the information immediately
available, although it can be delayed by a short amount. For instance, a
10-millisecond delay in a telephone call cannot be detected by the
speaker or listener. Likewise, it makes no difference if a radar signal is
delayed by a few seconds before being displayed to the operator.
Real-time applications input a sample, perform the algorithm, and
output a sample, over-and-over. Alternatively, they may input a group
of samples, perform the algorithm, and output a group of samples. This
is the world of Digital Signal Processors.
11
Circular Buffer
To calculate the output sample, we must have access to a certain

number of the most recent samples from the input. For example,
suppose we use eight coefficients in this filter, a0, a1, … a7. This means
we must know the value of the eight most recent samples from the
input signal, x[n], x[n-1], … x[n-7]. These eight samples must be stored
in memory and continually updated as new samples are acquired.
12
Circular Buffer
13
Circular Buffer
We have placed this circular buffer in eight consecutive memory
locations, 20041 to 20048. Figure (a) shows how the eight samples
from the input might be stored at one particular instant in time, while (b)
shows the changes after the next sample is acquired.
The idea of circular buffering is that the end of this linear array is
connected to its beginning; memory location 20041 is viewed as being
next to 20048, just as 20044 is next to 20045. You keep track of the
array by a pointer (a variable whose value is an address) that
indicates where the most recent sample resides. For instance, in (a)
the pointer contains the address 20044, while in (b) it contains 20045.
When a new sample is acquired, it replaces the oldest sample in the
array, and the pointer is moved one address ahead. Circular buffers
are efficient because only one value needs to be changed when a new
14
sample is acquired.
Circular Buffer
Four parameters are needed to manage a circular buffer.
1. There must be a pointer that indicates the start of the circular buffer
in memory (in this example, 20041).
2. There must be a pointer indicating the end of the array (e.g.,
20048), or a variable that holds its length (e.g., 8).
3. The step size of the memory addressing must be specified. e.g. the
step size is one, address 20043 contains one sample, address
20044 contains the next sample, and so on.
These three values define the size and configuration of the circular
buffer and will not change during the program operation.
4. The pointer to the most recent sample, must be modified as each
new sample is acquired. In other words, there must be program
logic that controls how this fourth value is updated based on the
15
value of the first three values.
Von Neumann architecture
16
Von Neumann architecture
A Von Neumann architecture consists a single memory and a single

bus for transferring data into and out of the central processing unit
(CPU).
Multiplying two numbers requires at least three machine cycles, one to

transfer each of the three numbers over the bus from the memory to
the CPU. We don't count the time to transfer the result back to
memory, because we assume that it remains in the CPU for additional
manipulation.
The Von Neumann design is quite satisfactory when you are content to
execute all of the required tasks in serial. In fact, most computers today
17
are of the Von Neumann design.
Harvard architecture
It consist of two separate memories for data and program instructions,

with separate buses for each. Since the buses operate independently,
program instructions and data can be fetched at the same time,
improving the speed over the single bus design. DSPs use this dual
bus architecture
18
The Super Harvard architecture
These are called SHARC® DSPs, a abbreviation of the longer term,

Super Harvard ARChitecture.
19
The Harvard architecture is modified by including an instruction cache,

and an I/O controller.
Instruction cache
DSP algorithms generally spend most of their execution time in loops.

This means that the same set of program instructions will continually
pass from program memory to the CPU. The Super Harvard
architecture takes advantage of this situation by including an instruction
cache in the CPU. This is a small memory that contains about 32 of the
most recent program instructions.
20
The first time through a loop, the program instructions must be passed
over the program memory bus. This results in slower operation
because of the conflict with the coefficients that must also be fetched
along this path.
However, on additional executions of the loop, the program instructions

can be pulled from the instruction cache. This means that all of the
memory to CPU information transfers can be accomplished in a single
cycle: the sample from the input signal comes over the data memory
bus, the coefficient comes over the program memory bus, and the
program instruction comes from the instruction cache.
This efficient transfer of data is called a high memory-access

21
bandwidth.
I/O controller are connected to data memory.
This is how the signals enter and exit the system.
The SHARC DSPs provides both serial and parallel communications

ports. These are extremely high-speed connections.
22
Simplified diagram of SHARC DSP
23
IO Controller
Figure shows the I/O controller connected to data

memory. This is how the signals enter and exit the
system. For instance, the SHARC DSPs provides both
serial and parallel communications ports. These are
extremely high-speed connections.
24
A dedicated hardware allows these data streams to be transferred
directly into memory (Direct Memory Access, or DMA), without having to
pass through the CPU's registers.
In other words, obtaining the sample from an IO device and storing the
sample at IO device happened independently and simultaneously with
the other tasks; no cycles are stolen from the CPU. The main buses
(program memory bus and data memory bus) are also accessible from
outside the chip, providing an additional interface to off-chip memory
and peripherals.
This allows the SHARC DSPs to use a four Gigaword (16 Gbyte)
memory.
25
Data Address Generator
It has two Data Address Generators (DAG), one for each of the two
memories. These control the addresses sent to the program and data
memories, specifying where the information is to be read from or written
to.
In simpler microprocessors this task is handled as an inherent part of the
program sequencer, and is quite transparent to the programmer.
However, DSPs are designed to operate with circular buffers, and benefit
from the extra hardware to manage them efficiently. This avoids needing
to use precious CPU clock cycles to keep track of how the data are
stored. For instance, in the SHARC DSPs, each of the two DAGs can
control eight circular buffers. This means that each DAG holds 26 32
variables (4 per buffer), plus the required logic.
Data Registers
The data register section of the CPU is used in the same way as in
traditional microprocessors. In the ADSP-2106x SHARC DSPs, there
are 16 general purpose registers of 40 bits each. These can hold
intermediate calculations, prepare data for the math processor, serve
as a buffer for data transfer, hold flags for program control, and so on.
If needed, these registers can also be used to control loops and
counters; however, the SHARC DSPs have extra hardware registers
to carry out many of these functions.
27
Math Processing
The math processing has three sections, a multiplier, an arithmetic
logic unit (ALU), and a barrel shifter.
The multiplier takes the values from two registers, multiplies them, and
places the result into another register.
The ALU performs addition, subtraction, absolute value, logical
operations (AND, OR, XOR, NOT), conversion between fixed and
floating point formats, and similar functions.
Elementary binary operations are carried out by the barrel shifter, such
as shifting, rotating, extracting and depositing segments, and so on. A
powerful feature of the SHARC family is that the multiplier and the ALU
can be accessed in parallel. In a single clock cycle, data from registers
0-7 can be passed to the multiplier, data from registers 8-15 can be
passed to the ALU, and the two results returned to any of the 16
28
registers. .
Shadow registers
shadow registers are the duplicate registers of all the key registers
that can be switched with their counterparts in a single clock cycle.
They are used for fast context switching, the ability to handle interrupts
quickly. When an interrupt occurs in traditional microprocessors, all the
internal data must be saved before the interrupt can be handled. This
usually involves pushing all of the occupied registers onto the stack,
one at a time. In comparison, an interrupt in the SHARC family is
handled by moving the internal data into the shadow registers in a
single clock cycle. When the interrupt routine is completed, the
registers are just as quickly restored.
29
Multiplier and Multiplier Accumulator
•Most common operation: array multiplication
•Before next input sample arrives, the multiplication should be completed
•Multiplication and accumulation to be carried out using hardware
elements.
Two approaches
A dedicated multiplier accumulator (MAC) can be implemented in

hardware which consist of multiplier and accumulator in a single
hardware unit. e.g. Motorola DSP 5600X.
Multiplier and accumulators are two separate hardware units. The out
of the multiplier is stored in the product register first and then added to
the accumulator. e.g. Texas DSP 320C5X
In both these approaches MAC operation can be completed in one
clock cycle. 30
Xn Xn-1 Xn-2 … Xn-M+3 Xn-M+2 Xn-M+1
Register
h0 h1 h2 … hM-3 hM-2 HM-1
The output at the nth sample instant is obtained by the multiplication of

the array xn corresponding to present and past m-1 samples of the
input with array h corresponding to impulse responses.
To obtained yn+1, the input signal array xn+1 is multiplied with the array
h.
31
The array xn+1 is obtained by shifting the array xn towards right so that
(n+1)th sample of the input data xn+1 becomes the first element. All the
elements of xn are shifted towards right by 1 position so that the i th
element of xn becomes the (i+1)th element of xn+1.

In the programmable DSPs, this shifting is achieved by instruction
MACD. Multiply accumulator with data shift.
e.g. TMS 320C5X has the instruction MACD pgm, dma. Means multiply
the contents of the program memory pgm with the contents of data
memory dma and stores the result in the product register. The contents
of product register is added to accumulator before new product is
stored. Further the contents of data memory are copied to the next
location whose address is dma + 1. 32
Multiple Access Memory
The number of the memory access / clock period can be increased by
using high speed memory that permits more than one memory access /
clock period. Multiple access RAM can be connected to the P DSP by
using Harvard Architecture
e.g. DARAM, the dual access RAM, permits two memory access / clock
period.
The DARAM connected to a P DSP with two independent address and
data bus can be used to access the four memory access / clock period.
33
Multiported Memory
This technique is also used to increase the number of memory access /
clock period. The dual port memory has two independent data and
address buses. Here two memory access in a clock period is possible.
Address bus 1 Data bus 1
Dual Port
memory
Address bus 2 Data bus 2
Multiported memories dispense the need of two separate memories for

program and data. It permits simultaneous access of both program and
data in single multiported memory chip.
Larger area required
Cost increases
34

1 DSP Processor

Uploaded by

Copyright:

Available Formats

1 DSP Processor

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 DSP Processor

Uploaded by

Copyright:

Available Formats

What is Processor ?

Machine Level Language

In the form of 0s and 1s hence recognize by machine (Processor)

Medium Level Language (Assembly Language)

In the form of mnemonics. Mnemonics are English word which

High Level Language

To calculate the output sample, we must have access to a certain

A Von Neumann architecture consists a single memory and a single

Multiplying two numbers requires at least three machine cycles, one to

It consist of two separate memories for data and program instructions,

These are called SHARC® DSPs, a abbreviation of the longer term,

The Harvard architecture is modified by including an instruction cache,

DSP algorithms generally spend most of their execution time in loops.

However, on additional executions of the loop, the program instructions

This efficient transfer of data is called a high memory-access

I/O controller are connected to data memory.

This is how the signals enter and exit the system.

The SHARC DSPs provides both serial and parallel communications

Figure shows the I/O controller connected to data

A dedicated multiplier accumulator (MAC) can be implemented in

Xn Xn-1 Xn-2 … Xn-M+3 Xn-M+2 Xn-M+1

The output at the nth sample instant is obtained by the multiplication of

elements of xn are shifted towards right by 1 position so that the i th

element of xn becomes the (i+1)th element of xn+1.

Address bus 2 Data bus 2

Multiported memories dispense the need of two separate memories for

You might also like