Module 2 Notes
Module 2 Notes
Module 2 Notes
Devices
1. Investigate the basic features that should be provided in the DSP architecture to
be used to implement the following Nth order FIR filter
y(n)= Σh(i) x(n-i) n=0,1,2…
In order to implement the above operation in a DSP, the architecture requires the
following features
2.2.1 Multipliers
The advent of single chip multipliers paved the way for implementing DSP
functions on a VLSI chip. Parallel multipliers replaced the traditional shift and add
multipliers now a days. Parallel multipliers take a single processor cycle to fetch and
execute the instruction and to store the result. They are also called as Array multipliers.
The key features to be considered for a multiplier are:
a. Accuracy
b. Dynamic range
c. Speed
The number of bits used to represent the operands decide the accuracy and the
dynamic range of the multiplier. Whereas speed is decided by the architecture employed.
If the multipliers are implemented using hardware, the speed of execution will be very
high but the circuit complexity will also increases considerably. Thus there should be a
tradeoff between the speed of execution and the circuit complexity. Hence the choice of
the architecture normally depends on the application.
In the Braun multiplier the sign of the numbers are not considered into account. In
order to implement a multiplier for signed numbers, additional hardware is required to
modify the Braun multiplier. The modified multiplier is called as Baugh-Wooley
multiplier.
Consider two signed numbers A and B,
2.2.4 Speed
Conventional Shift and Add technique of multiplication requires n cycles to
perform the multiplication of two n bit numbers. Whereas in parallel multipliers the time
required will be the longest path delay in the combinational circuit used. As DSP
applications generally require very high speed, it is desirable to have multipliers
operating at the highest possible speed by having parallel implementation.
2. It is required to find the sum of 64, 16 bit numbers. How many bits should the
accumulator have so that the sum can be computed without the occurrence of
overflow error or loss of accuracy?
The sum of 64, 16 bit numbers can grow up to (16+ log2 64 )=22 bits long. Hence
the accumulator should be 22 bits long in order to avoid overflow error from occurring.
3. In the previous problem, it is decided to have an accumulator with only 16 bits
but shift the numbers before the addition to prevent overflow, by how many bits
should each number be shifted?
4. If all the numbers in the previous problem are fixed point integers, what is the
actual sum of the numbers?
The actual sum can be obtained by shifting the result by 6 bits towards left side
after the sum being computed. Therefore
Actual Sum= Accumulator content X 2 6
Output bits
Figure 2.4 depicts the implementation of a 4 bit shift right barrel shifter. Shift to
right by 0, 1, 2 or 3 bit positions can be controlled by setting the control inputs
appropriately.
5. A Barrel Shifter is to be designed with 16 inputs for left shifts from 0 to 15 bits.
How many control lines are required to implement the shifter?
As the number of bits used to represent the input are 16, log2 16=4 control inputs
are required.
Most of the DSP applications require the computation of the sum of the products
of a series of successive multiplications. In order to implement such functions a special
unit called a Multiply and Accumulate (MAC) unit is required.
Although addition and multiplication are two different operations, they can be
performed in parallel. By the time the multiplier is computing the product, accumulator
can accumulate the product of the previous multiplications. Thus if N products are to be
accumulated, N-1 multiplications can overlap with N-1 additions. During the very first
multiplication, accumulator will be idle and during the last accumulation, multiplier will
be idle. Thus N+1 clock cycles are required to compute the sum of N products.
6. If a sum of 256 products is to be computed using a pipelined MAC unit, and if the
MAC execution time of the unit is 100nsec, what will be the total time required to
complete the operation?
While designing a MAC unit, attention has to be paid to the word sizes
encountered at the input of the multiplier and the sizes of the add/subtract unit and the
accumulator, as there is a possibility of overflow and underflows.
Shifters
Shifters can be provided at the input of the MAC to normalize the data and at the
output to denormalize the same.
Guard bits
As the normalization process does not yield accurate result, it is not desirable for
some applications. In such cases we have another alternative by providing additional bits
called guard bits in the accumulator so that there will not be any overflow error. Here the
add/subtract unit also has to be modified appropriately to manage the additional bits of
the accumulator.
7. Consider a MAC unit whose inputs are 16 bit numbers. If 256 products are to be
summed up in this MAC, how many guard bits should be provided for the
accumulator to prevent overflow condition from occurring?
As it is required to calculate the sum of 256, 16 bit numbers, the sum can be as
long as (16+ log2 256)=24 bits. Hence the accumulator should be capable of handling
these 22 bits. Thus the guard bits required will be (24-16)= 8 bits.
The block diagram of the modified MAC after considering the guard or extention
bits is as shown in the figure 2.6.
Fig 2.6 MAC Unit with Guard Bits
8. What should be the minimum width of the accumulator in a DSP device that
receives 10 bit A/D samples and is required to add 64 of them without causing an
overflow?
As it is required to calculate the sum of 64, 10 bit numbers, the sum can be as
long as (10+ log2 64)=16 bits. Hence the accumulator should be capable of handling these
16 bits. Thus the guard bits required will be (16-10)= 6 bits.
Saturation Logic
Overflow/ underflow will occur if the result goes beyond the most positive
number or below the least negative number the accumulator can handle. Thus the
overflow/underflow error can be resolved by loading the accumulator with the most
positive number which it can handle at the time of overflow and the least negative
number that it can handle at the time of underflow. This method is called as saturation
logic. A schematic diagram of saturation logic is as shown in figure 2.7.
Status Flags
ALU includes circuitry to generate status flags after arithmetic and logic
operations. These flags include sign, zero, carry and overflow.
Overflow Management
Depending on the status of overflow and sign flags, the saturation logic can be
used to limit the accumulator content.
Register File
Instead of moving data in and out of the memory during the operation, for better
speed, a large set of general purpose registers are provided to store the intermediate
results.
2.5 Bus Architecture and Memory
In order to increase the speed of operation, separate memories were used to store
program and data and a separate set of data and address buses have been given to both
memories, the architecture called as Harvard Architecture. It is as shown in figure 2.10.
Although the usage of separate memories for data and the instruction speeds up
the processing, it will not completely solve the problem. As many of the DSP instructions
require more than one operand, use of a single data memory leads to the fetch the
operands one after the other, thus increasing the delay of processing. This problem can be
overcome by using two separate data memories for storing operands separately, thus in a
single clock cycle both the operands can be fetched together (Figure 2.11).
Although the above architecture improves the speed of operation, it requires more
hardware and interconnections, thus increasing the cost and complexity of the system.
Therefore there should be a trade off between the cost and speed while selecting memory
architecture for a DSP.
Speed
On-chip memories should match the speeds of the ALU operations in order to
maintain the single cycle instruction execution of the DSP.
Size
In a given area of the DSP chip, it is desirable to implement as many DSP
functions as possible. Thus the area occupied by the on-chip memory should be minimum
so that there will be a scope for implementing more number of DSP functions on- chip.
Ideally whole memory required for the implementation of any DSP algorithm has
to reside on-chip so that the whole processing can be completed in a single execution
cycle. Although it looks as a better solution, it consumes more space on chip, reducing
the scope for implementing any functional block on-chip, which in turn reduces the speed
of execution. Hence some other alternatives have to be thought of. The following are
some other ways in which the on-chip memory can be organized.
In this mode, one of the registers will be holding the data and the register has to
be specified in the instruction.
In this addressing mode, instruction holds the memory location of the operand.
10. Identify the addressing modes of the operands in each of the following
instructions
a. ADD #1234h
b. ADD 1234h
c. ADD *AR+
d. ADD offsetreg-,*AR
For the implementation of some real time applications in DSP, normal addressing
modes will not completely serve the purpose. Thus some special addressing modes are
required for such applications.
2.7.1 Circular Addressing Mode
There are four special cases in this addressing mode. They are
a. SAR < EAR & updated PNTR > EAR
b. SAR < EAR & updated PNTR < SAR
c. SAR >EAR & updated PNTR > SAR
d. SAR > EAR & updated PNTR < EAR
The buffer length in the first two case will be (EAR-SAR+1) whereas for the next
tow cases (SAR-EAR+1)
The pointer updating algorithm for the circular addressing mode is as shown
below.
Four cases explained earlier are as shown in the figure 2.12.
12. Repeat the previous problem for SAR= 0210h and EAR=0201h
It works as follows. Start with index 0. The present index can be calculated by
adding half the FFT length to the previous index in a bit reversed manner, carry being
propagated from MSB to LSB.
13. Compute the indices for an 8-point FFT using Bit reversed Addressing Mode
The process continues till all the indices are calculated. The following table summarizes
the calculation.
Index in Binary BCD value Bit reversed index BCD value
000 0 000 0
001 1 100 4
010 2 010 2
011 3 110 6
100 4 001 1
101 5 101 5
110 6 011 3
111 7 111 7
The main job of the Address Generation Unit is to generate the address of the
operands required to carry out the operation. They have to work fast in order to satisfy
the timing constraints.
The block diagram of a typical address generation unit is as shown in figure 2.13.
Fig 2.13 Address Generation Unit
a. Program Counter
b. Instruction register in case of branching, looping and subroutine calls
c. Interrupt Vector table
d. Stack which holds the return address
Hardware architecture
Functions such as multiplication,scaling,loops and repeats and special addressing modes are essential
for signal processing algorithms.The architectures designed for the signal processing applications
should implement these functions in the quickest possible time.This is achieved by hardware
units,which are specially designed to implement these functions.To increase the speed of the operations
considerably,parallel multipliers have been used to carry out the entire multiplication in a single clock
cycle.
Harvard architecture which separates program and data memories with separate buses for
each,increases the speed of execution of programs considerably.Dual data memories with individual
buses for each help in accessing dual operands simultaneously.
Multiple external memories require multiple buses external to the DSP.In addition to being
expensive,external buses are slow for program access and execution.By providing on chip memories
and an instruction cache,program execution is speeded up considerably.Further,these on chip memories
can also be accessed twice in a clock cycle,thereby reducing the number of separate memories and
buses required in a device.
In addition to the hardware issues ,there are many techniques used in DSP architectures to increase
their speed of operation.such as parallelism,pipelining
Parallelism
A very major requirement to achieve high speed of operation in DSP architecture is the provision of
parallelism.Parallelism is the provision of functional units,which may operate in parallel and increase
the throughput .For example separate address arithmetic unit provided to take care of address
computations,this frees up the main arithmetic unit to concentrate on data computations alone and
thereby increases the throughput.Another is provision of multiple memories and multiple buses to fetch
an instruction and operands simultaneously.
Availability of multiple functional units can increase the speed of the DSP architectures.An ideal
parallelism in the DSP architecture with regard to the multiply and accumulate operation,which is the
most used operation in DSP implementations,should be able to accomplish the following operations in
a single clock cycle.
• Fetch instructions and multiple data required for the computation.
• Shift data as they are fetched on order to accomplish scaling
• Carry out a multiplication operation on the fetched data
• Add the product to the previously computed result in the accumulator
• Save the accumulator contents in the memory storage,if required,and
• Compute new addresses for the instruction and data required for the next operation
Pipelining
An ar hitectural feature to increase the speed of the DSP algorithm is pipelining.In a pipelined
architecture ,an instruction to be executed is broken into a number of steps.A separate unit of the
architecture performs each of these steps.When the first of these units performs the first step on the
current instruction,the second unit will be performing the second step on the previous instruction,the
third unit will be performing the third step on the instruction prior to that etc.If p steps were required to
complete the execution of each instruction,it would take p units of time for the complete execution of
each instruction.However,since all the units will work all the time,one output will flow out of the
architecture at the end of each time unit,and the throughput can be maintained as one instruction per
unit time.Thus assume the execution of an instruction can be broken into five steps:instruction
fetch,instruction decode,operand fetch,execute and save the result.
Performance summary
Peripherals include interfaces for interrupts,direct memory access, serial I/O, and parallel
I/O,timer,serial and parallel signal converters