1 DSP Processor
1 DSP Processor
1 DSP Processor
1
Languages
2
Language
3
Language
Instruction
Execution cycle
Machine Cycle
Subroutine
Interrupts
4
Conventional Processors and DSP Processors
5
The computers are extremely capable in two broad areas,
(1)data manipulation, such as word processing and database
management, and
(2)mathematical calculation, used in science, engineering, and Digital
Signal Processing. All microprocessors can perform both tasks;
however, it is difficult (expensive) to make a device that is optimized for
both. There are technical tradeoffs in the hardware design, such as the
size of the instruction set and how interrupts are handled.
Even more important, there are marketing issues involved:
development and manufacturing cost, competitive position, product
lifetime, and so on. As a broad generalization, these factors have made
traditional microprocessors, such as the Pentium®, primarily directed at
data manipulation. Similarly, DSPs are designed to perform the
6
mathematical calculations needed in Digital Signal Processing.
(1) Data manipulation, Data manipulation involves storing and sorting
information. Consider a word processing program. The basic task is
to store the information (typed in by the operator), organize the
information (cut and paste, spell checking, page layout, etc.), and
then retrieve the information (such as saving the document on a
floppy disk or printing it with a printer). These tasks are
accomplished by moving data from one location to another,
7
(2) Mathematical calculation, the execution speed of most DSP
algorithms is limited almost completely by the number of
multiplications and additions required.
consider the implementation of an FIR digital filter, the input signal is
referred to by x[ ], while the output signal is denoted by y[ ]. Our task
is to calculate the sample at location n in the output signal, i.e., y[n].
An FIR filter performs this calculation by multiplying appropriate
samples from the input signal by a group of coefficients, denoted by:
a0, a1, a2, a3, …, and then adding the products. In equation form, y[n]
is found by:
8
Here the input signal has been convolved with a filter kernel (i.e., an
impulse response) consisting of a0, a1, a2, a3, an. Depending on the
application, there may only be a few coefficients in the filter kernel, or
many thousands. While there is some data transfer and inequality
evaluation in this algorithm, such as to keep track of the intermediate
results and control the loops, the math operations dominate the
execution time.
9
Offline processing
In the offline processing the entire input signal resides in the computer
at the same time.
For example, a geologist might use a seismometer to record the
ground movement during an earthquake. After the shaking is over, the
information may be read into a computer and analyzed in some way.
Another example of off-line processing is medical imaging, such as
computed tomography and MRI. The data set is acquired while the
patient is inside the machine, but the image reconstruction may be
delayed until a later time. The key point is that all of the information is
simultaneously available to the processing program. This is common in
scientific research and engineering, but not in consumer products. Off-
line processing is the area of personal computers and mainframes. 10
Real time processing
In real-time processing, the output signal is produced at the same time
that the input signal is being acquired.
For example, this is needed in telephone communication, hearing aids,
and radar. These applications must have the information immediately
available, although it can be delayed by a short amount. For instance, a
10-millisecond delay in a telephone call cannot be detected by the
speaker or listener. Likewise, it makes no difference if a radar signal is
delayed by a few seconds before being displayed to the operator.
Real-time applications input a sample, perform the algorithm, and
output a sample, over-and-over. Alternatively, they may input a group
of samples, perform the algorithm, and output a group of samples. This
is the world of Digital Signal Processors.
11
Circular Buffer
12
Circular Buffer
13
Circular Buffer
We have placed this circular buffer in eight consecutive memory
locations, 20041 to 20048. Figure (a) shows how the eight samples
from the input might be stored at one particular instant in time, while (b)
shows the changes after the next sample is acquired.
The idea of circular buffering is that the end of this linear array is
connected to its beginning; memory location 20041 is viewed as being
next to 20048, just as 20044 is next to 20045. You keep track of the
array by a pointer (a variable whose value is an address) that
indicates where the most recent sample resides. For instance, in (a)
the pointer contains the address 20044, while in (b) it contains 20045.
When a new sample is acquired, it replaces the oldest sample in the
array, and the pointer is moved one address ahead. Circular buffers
are efficient because only one value needs to be changed when a new
14
sample is acquired.
Circular Buffer
Four parameters are needed to manage a circular buffer.
1. There must be a pointer that indicates the start of the circular buffer
in memory (in this example, 20041).
2. There must be a pointer indicating the end of the array (e.g.,
20048), or a variable that holds its length (e.g., 8).
3. The step size of the memory addressing must be specified. e.g. the
step size is one, address 20043 contains one sample, address
20044 contains the next sample, and so on.
These three values define the size and configuration of the circular
buffer and will not change during the program operation.
4. The pointer to the most recent sample, must be modified as each
new sample is acquired. In other words, there must be program
logic that controls how this fourth value is updated based on the
15
value of the first three values.
Von Neumann architecture
16
Von Neumann architecture
The Von Neumann design is quite satisfactory when you are content to
execute all of the required tasks in serial. In fact, most computers today
17
are of the Von Neumann design.
Harvard architecture
19
The Super Harvard architecture
Instruction cache
20
The Super Harvard architecture
The first time through a loop, the program instructions must be passed
over the program memory bus. This results in slower operation
because of the conflict with the coefficients that must also be fetched
along this path.
22
Simplified diagram of SHARC DSP
23
IO Controller
24
A dedicated hardware allows these data streams to be transferred
directly into memory (Direct Memory Access, or DMA), without having to
pass through the CPU's registers.
In other words, obtaining the sample from an IO device and storing the
sample at IO device happened independently and simultaneously with
the other tasks; no cycles are stolen from the CPU. The main buses
(program memory bus and data memory bus) are also accessible from
outside the chip, providing an additional interface to off-chip memory
and peripherals.
This allows the SHARC DSPs to use a four Gigaword (16 Gbyte)
memory.
25
Data Address Generator
It has two Data Address Generators (DAG), one for each of the two
memories. These control the addresses sent to the program and data
memories, specifying where the information is to be read from or written
to.
In simpler microprocessors this task is handled as an inherent part of the
program sequencer, and is quite transparent to the programmer.
However, DSPs are designed to operate with circular buffers, and benefit
from the extra hardware to manage them efficiently. This avoids needing
to use precious CPU clock cycles to keep track of how the data are
stored. For instance, in the SHARC DSPs, each of the two DAGs can
control eight circular buffers. This means that each DAG holds 26 32
variables (4 per buffer), plus the required logic.
Data Registers
The data register section of the CPU is used in the same way as in
traditional microprocessors. In the ADSP-2106x SHARC DSPs, there
are 16 general purpose registers of 40 bits each. These can hold
intermediate calculations, prepare data for the math processor, serve
as a buffer for data transfer, hold flags for program control, and so on.
If needed, these registers can also be used to control loops and
counters; however, the SHARC DSPs have extra hardware registers
to carry out many of these functions.
27
Math Processing
The math processing has three sections, a multiplier, an arithmetic
logic unit (ALU), and a barrel shifter.
The multiplier takes the values from two registers, multiplies them, and
places the result into another register.
The ALU performs addition, subtraction, absolute value, logical
operations (AND, OR, XOR, NOT), conversion between fixed and
floating point formats, and similar functions.
Elementary binary operations are carried out by the barrel shifter, such
as shifting, rotating, extracting and depositing segments, and so on. A
powerful feature of the SHARC family is that the multiplier and the ALU
can be accessed in parallel. In a single clock cycle, data from registers
0-7 can be passed to the multiplier, data from registers 8-15 can be
passed to the ALU, and the two results returned to any of the 16
28
registers. .
Shadow registers
shadow registers are the duplicate registers of all the key registers
that can be switched with their counterparts in a single clock cycle.
They are used for fast context switching, the ability to handle interrupts
quickly. When an interrupt occurs in traditional microprocessors, all the
internal data must be saved before the interrupt can be handled. This
usually involves pushing all of the occupied registers onto the stack,
one at a time. In comparison, an interrupt in the SHARC family is
handled by moving the internal data into the shadow registers in a
single clock cycle. When the interrupt routine is completed, the
registers are just as quickly restored.
29
Multiplier and Multiplier Accumulator
•Most common operation: array multiplication
•Before next input sample arrives, the multiplication should be completed
•Multiplication and accumulation to be carried out using hardware
elements.
Two approaches
Register
h0 h1 h2 … hM-3 hM-2 HM-1
(n+1)th sample of the input data xn+1 becomes the first element. All the
33
Multiported Memory
This technique is also used to increase the number of memory access /
clock period. The dual port memory has two independent data and
address buses. Here two memory access in a clock period is possible.
Address bus 1 Data bus 1
Dual Port
memory