ACA20012021 - Vector & Multiple Issue Processor - 2
ACA20012021 - Vector & Multiple Issue Processor - 2
ACA20012021 - Vector & Multiple Issue Processor - 2
VMPY
VR 3.1 * VR 4.1
Vector Processor Implementation
• The illustrated chained ADD- MPY with each
functional unit having 4 stages, saves 64
cycles.
• If unchained it would have taken 4 (startup) +
64 (elements/VR) = 68 cycles for each
function – a total of 136 cycles.
• With chaining this is reduced to 4 (add start
up) + 4 (multiply startup) + 64 (elements/VR)
= 72 Cycles – A saving of 64 cycles.
Vector Memory
• Simple low order interleaving used in normal
pipelined processors is not suitable for vector
processors.
• Since access in case of vectors is non
sequential but systematic, thus if array
dimension or stride( address distance
between adjacent elements) is same as
interleaving factor then all references will
concentrate on same module.
Vector Memory
• It is quite common for these strides to be of
the form 2k or other even dimensions.
• So vector memory designs use address
remapping and use a prime number of
memory modules.
• Hashed addresses is a technique for
dispersing addresses.
• Hashing is a strict 1:1 mapping of the bits in
X to form a new address X’ based on simple
manipulations of the bits in X.
Vector Memory
• A memory system used in vector / matrix
accessing consists of following units.
– Address Hasher
– 2k + 1 memory modules.
– Module mapper.
This may add certain overhead and add extra
cycles to memory access but since the purpose
of the memory is to access vectors, this can be
overlapped in most cases.
X address Vector Memory
Address
Hasher Module Index (X’ / 2k)
X’
Module
address Address in a module
Computation
X’ mod (2k + 1)
Calculating opt
Superscalar Machines
• The maximum program speed up available in
multiple issue machines, largely depends on
sophisticated compiler technology.
• The potential speedup available from multiflow
compiler using trace scheduling is generally
les than 3.
• Recent multiple issue machines having more
modest objectives, are called Superscalar
Machines.
• The ability to issue multiple instructions in a
single cycle is referred to as Superscalar
implementation
Major Data Paths in a Generic
M.I.Machine
(Refer to fig 7.28 on page no 458.)
• Simultaneous access to data as required
by VLIW processor mandates extensive
use of register ports which can become a
bottleneck.
• Dynamic multiple issue processor use
multiple buses connecting the register sets
and functional units and each bus services
multiple functional units.
• This may limit the maximum degree of
concurrency.
Cost Comparison
• These registers occupy significant amount
of area. If p is the no of ports, the area
required is
Area = (No of reg +3p)(bits per reg +3p) rbe.
• Most vector processors have 8 sets of 64
element registers with each element being
64 bit in size.
• Each vector register is dual ported ( a read
port and a write port). Since registers are
sequentially accessed each port can be
shared by all elements in the register set.
Cost Comparison
• There is an additional switching overhead
to switch each of n vector registers to each
of p external ports.
Switch area = 2 (bits per reg).p. (no of reg)
• So area used by registers set in vector
processors (supporting 8ports) is-
Area = 8x[(64+6) (64+6)] =39,200 rbe.
Switch area = 2 (64).8.(64) = 8192 rbe.
[Note: Register bit equivalent (rbe) is a useful unit defined to be a 6-transistor register (memory) cell
and represents about 2700 2. Even larger unit, A is defined as 1 mm2 of die area at f = 1 m. This
is also the area occupied by a 32×32 bit three-ported register file or 1481 rbe.]
Cost Comparison
• A multiple issue processor with 32 registers
each having 64 bits and supporting 8 ports
will require
Area = (32+3(8))(64+3(8)) =4928 rbe
• So vector processors use almost 42,464 rbe
of extra area compared to MI processors.
• This extra area corresponds to about 70,800
cache bits ( .6 rbe/bit) ie approximately 8 KB
of data cache.
• Vector processors use small data cache.
Cost Comparison
• Multiple issue machines require larger data
cache to ensure high performance.
• Vector processors require support hardware
for managing access to memory system.
• Also high degree of interleaving is required in
memory system to support processor
bandwidth
• M.I machines must support 4-6 reads and 2-
3 writes per cycle. This increases the area
required by buses between arithmetic units
and registers.
Cost Comparison
• M.I machines should access and hold multiple
instructions each cycle from I- cache
• This increases the size of I-fetch path between
I-cache and instruction decoder/ instruction
register.
• At instruction decoder multiple instructions must
be decoded simultaneously and detection for
instruction independence must be performed.
• The main difference depends heavily on the
size of data cache required by MI machines
and cost of memory interleaving required by
vector processor.
Performance Comparison
• The performance of vector processors
depends primarily on two factors
– Percentage of code that is vectorizable.
– Average length of vectors.
We know that n1/2 or the vector size at which the
vector processor achieves approx half its
asymptotic performance is roughly the same
as arithmetic plus memory access pipeline.
For short vectors data cache is sufficient in MI
machines so for
Short Vectors M.I .processors would perform
better than equivalent vector processor.
Performance Comparison
• As vectors get longer the performance of
M.I. machine becomes much more
dependent on size of data cache and n1/2
of vector processors improve.
• So for long vectors performance would be
better in case of vector processor.
• The actual difference depends largely on
sophistication in compiler technology.
• Compiler can recognize occurrence of
short vector and treat that portion of code
as if it were a scalar code