ACA20012021 - Vector & Multiple Issue Processor - 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Vector Processor Implementation

• Vector processor implementation requires


considerable amount of additional control and
hardware.
• Vector registers used to store vector operands
generally bypass the data cache.
• Data cache is used solely to store scalar
values.
• Since particular values stored in a vector may
be aliased as a scalar and stored in a data
cache all vector references to memory must be
checked against contents of data cacahe.

Vector Processor Implementation


• If there is a hit, invalidate the current value
contained in data cache and force memory
update.
• Additional H/W or S/W control is required to
ensure that scalar references from data cache
to memory do not inadvertently reference a
value contained in vector register.
• Earlier Vector processors used Memory to
Memory instruction format, but due to severe
memory congestion and contention problems
most recent vector processor use vector
registers to load / store vector operands.
Vector Processor Implementation
• Vector registers generally consists of a set of
eight registers each containing from 16 to 64
entries of the size of a floating point word.
• The arithmetic pipeline may be shared with the
scalar part of the processor.
• Under some conditions, it is possible to
execute more than one arithmetic operation
per cycle.
• Te result of one arithmetic operation can be
directly used as an operand in subsequent
vector instruction.

Vector Processor Implementation


• This is called Chaining.
• For the two instructions
VADD VR3, VR1, VR2
VMPY VR5, VR3, VR4.
VR 1.3 VR 1.2
To VR 3.1
+ +
VADD
VR 2.3 VR 2.2

VMPY
VR 3.1 * VR 4.1
Vector Processor Implementation
• The illustrated chained ADD- MPY with each
functional unit having 4 stages, saves 64
cycles.
• If unchained it would have taken 4 (startup) +
64 (elements/VR) = 68 cycles for each
function – a total of 136 cycles.
• With chaining this is reduced to 4 (add start
up) + 4 (multiply startup) + 64 (elements/VR)
= 72 Cycles – A saving of 64 cycles.

Vector Processor Implementation


• Another important aspect of vector
implementation is management of references to
memory.
• Memory must have sufficient bandwidth to
support minimum two and preferably three
references per cycle (two reads and one write)
• This bandwidth allows for two vector reads and
one vector write to be initiated and executed
concurrently with execution of a vector
arithmetic operation.
Vector Processor Implementation
• Major data paths in a generic vector processor
are shown in Figure 7.14 on page no 438 of
computer architecture by Michael Flynn.

Vector Memory
• Simple low order interleaving used in normal
pipelined processors is not suitable for vector
processors.
• Since access in case of vectors is non
sequential but systematic, thus if array
dimension or stride( address distance
between adjacent elements) is same as
interleaving factor then all references will
concentrate on same module.
Vector Memory
• It is quite common for these strides to be of
the form 2k or other even dimensions.
• So vector memory designs use address
remapping and use a prime number of
memory modules.
• Hashed addresses is a technique for
dispersing addresses.
• Hashing is a strict 1:1 mapping of the bits in
X to form a new address X’ based on simple
manipulations of the bits in X.

Vector Memory
• A memory system used in vector / matrix
accessing consists of following units.
– Address Hasher
– 2k + 1 memory modules.
– Module mapper.
This may add certain overhead and add extra
cycles to memory access but since the purpose
of the memory is to access vectors, this can be
overlapped in most cases.
X address Vector Memory

Address
Hasher Module Index (X’ / 2k)
X’

Module
address Address in a module
Computation

X’ mod (2k + 1)

Data ... ...


2k + 1 Modules

Modeling Vector Memory


Performance
• Vector memory is designed for multiple
simultaneous requests to memory.
• Operand fetching and storing is overlapped
with vector execution.
• Three concurrent operand access to memory
are a common target but increased cost of
memory system may limit this to two.
• Chaining may require even more accesses.
• Another issue is the degree of bypassing or
out of order requests that a source can make
to memory system.
Modeling Vector Memory
Performance
• In case of conflict ie a request being directed
to a busy module, the source can continue to
make subsequent requests only if not
serviced requests are held in a buffer.
• Assume each of ‘s’ access ports to memory
has a buffer of size TBF/s which holds
requests that are being held due to a conflict.
• For each source, degree of bypassing is
defined as the allowable number of requests
waiting before stalling of subsequent
requests occurs.

Modeling Vector Memory


Performance
• If Qc is the expected number of denied
requests per module and m is the number of
modules, Then buffer size must be large
enough to hold denied requests
Buffer = TBF > m. Qc
• If n is the total number of requests made and
B is the bandwidth achieved then
m . Qc = n-B ( denied requests)
Gamma ( ) – Binomial Model
• Assume that each vector source issues a
request each cycle( =1) and each
physical requestor has the same buffer
capacity and characteristics.
• If the vector processor can make s
requests per cycle and there are t cycles
per Tc, then
Total requests per Tc = t . s = n
This is same as n requests per Tc in simple
binomial model .

Gamma ( ) – Binomial Model


• If is the mean queue size of bypassed
requests awaiting service then each of –
buffered requests also make a request.
• From memory modeling point of view this
is equivalent to buffer requesting service
each cycle until module is free.
Total request per Tc = t . s + t . s .
= t. s(1 + )
= n (1 + ).
Gamma ( ) – Binomial Model
• Substituting this value in simple binomial
equation
B (m, n, )
= m+n(1+ )-1/2 - (m+n(1+ )-1/2)2 – 2nm(1+ )

Calculating opt

• The is the mean expected bypassed request


queue per source.
• If we continue to increase number of bypass buffer
registers we can achieve a opt which totally
eliminates contention.
• No contention occurs when B = n or B(m,n, ) = n
• This occurs when a = = n/m
• Since MB/ D/1 queue size is given by
Q = a2 – p a / 2(1- a) = n(1+ ) –B /m
Calculating opt
• Substituting a = = n/m and p=1/m we get:
Q=( n2-n)/2((m2-nm) = (n/m )(n-1) /(2m-2n)
• Since Q = (n(1+ ) –B) /m
So mQ = n(1+ ) –B
Now for opt (n-B) =0
So opt = m/n Q
So opt = n-1/ 2m-2n
And mean total buffer size (TBF) = n opt
To avoid overflow buffer may be considerably
larger may be 2 x TBF

Vector Processor Speed up:


Performance Relative to
Pipelined Processor
Vector processor performance depends on
1. The amount of program that can be
expressed in a vectorizable form.
2. Vector startup costs ( length of pipeline)
3. Number of execution units and support
for chaining.
4. Number of operands that can be
simultaneously accessed /stored
5. Number of vector registers.
Vector Processor Speedup
• The overall effect of speedup possible by
the vector processor over the pipelined
processor is generally limited to 4.
• This assumes concurrent execution of 2
load, 1 store and 1 arithmetic operation.
• If chaining is allowed and memory system
can accommodate an additional concurrent
load instruction, the speedup can extend to
<6 ( 3LDs, 2 Arith, and 1 ST)

Vector Processor Speedup


• In practice such speedups are not achievable
for following reasons.
– To sustain such high speedup the program must
consist of purely vector code.
– Due to some address arithmetic and control
operations, some part of program is not
vectorizable.
– Suppose maximum speedup of 4 is achievable for
75% of operations
Overall speedup Sp = T1/ (%vector . (T1/4) + %
nonvector . T1) = 2.3
Vector Processor Speedup
– The depth of pipeline also limits the effective
speedup
Two parameters R and n1/2 are used to characterize
vector processors performance.
R = 1/ t = maximum vector arithmetic execution
rate that the processor can sustain in the absence
of chaining. ( t cycle time of vector pipeline)
If c functional units can be chained the max
performance = c * R ( usually c =2)
n1/2 is a parameter that measures the depth of
pipeline.

Vector Processor Speedup


– Assume n is the vector size ( no of elements) then
n1/2 is the length of vector operand that achieves
exactly one half of maximum performance.
– Usually vector processor can not start a new
instruction until a previous instruction is finished
using the pipeline ( Pipeline flushing)
– This delay and startup time correspond to depth of
pipeline and are represented by n1/2.
Vector efficiency R/R = 1 / 1+ 1/(n/ n1/2)
Vector efficiency is very low if vector length n is
significantly less than n1/2..
Vector Processor Speedup
– Memory contention under heavy referencing can
add to delay
– This contention causes vector registers to be
loaded and stored more slowly than vector
instructions are executed. The waiting time between
two vector operations adds to delay.
– So generally for memory limited processors
n1/2 = max [ vector startup cycles, vector memory
overhead cycles]
Vector processors speedup is thus limited below
maximum value of 4 or 5, but for code having
significant vector content speedup of 2-3 is
generally achievable.

Multiple – Issue Machines


• These machines evaluate dependencies
among group of instructions, and groups
found to be independent are simultaneously
dispatched to multiple execution units.
• There are two broad classes of multiple
issue machines
– Statically Scheduled: detection process is done
by the compiler.
– Dynamically Scheduled: Detection of
independent instructions is done by hardware in
the decoder at run time.
Statically Scheduled Machines
• Sophisticated compilers search the code at
compile time and instructions found
independent in their effect are assembled
into instruction packets, which are decoded
and executed at run time.
• Statically scheduled processors must have
some additional information either implicitly
or explicitly indicating instruction packet
boundaries.

Statically Scheduled Machines


• Early statically scheduled machines include
the so called VLIW ( Very long instruction
word) machines.
• These machines use an instruction word that
consists of 8 to 10 instruction fragments.
• Each fragment controls a designated
execution unit.
• To accommodate multiple instruction
fragments the instruction word is typically over
200 bits long.
Statically Scheduled Machines
• The register set is extensively multi ported to
support simultaneous access to multiple
execution units.
• To avoid performance limitation by
occurrence of branch instructions a novel
compiler technology called trace scheduling is
used.
• In trace scheduling branches are predicted
where possible and predicted path is
incorporated into a large basic block.

Statically Scheduled Machines


• If an unanticipated ( or unpredicted) branch
occurs during the execution of the code, at
the end of the basic block the proper result is
fixed up for use by a target basic block.
Dynamically Scheduled Machines.
• In dynamically scheduled machines
detection of independent instruction is
done by hardware at run time.
• The detection may also be done at
compile time and code suitably arranged
to optimize execution patterns.
• At run time the search for concurrent
instructions is restricted to the localities of
the last executing instruction.

Superscalar Machines
• The maximum program speed up available in
multiple issue machines, largely depends on
sophisticated compiler technology.
• The potential speedup available from multiflow
compiler using trace scheduling is generally
les than 3.
• Recent multiple issue machines having more
modest objectives, are called Superscalar
Machines.
• The ability to issue multiple instructions in a
single cycle is referred to as Superscalar
implementation
Major Data Paths in a Generic
M.I.Machine
(Refer to fig 7.28 on page no 458.)
• Simultaneous access to data as required
by VLIW processor mandates extensive
use of register ports which can become a
bottleneck.
• Dynamic multiple issue processor use
multiple buses connecting the register sets
and functional units and each bus services
multiple functional units.
• This may limit the maximum degree of
concurrency.

Comparing Vector and Multiple


Issue Processors
• The goal of any processor design is to
provide cost effective computation across a
range of applications.
• So we should compare the two technologies
based on following two factors.
– Cost
– Performance
Cost Comparison
• While comparing the cost we must
approximate the area used by both the
technologies in the form of additional /
required units.
• The cost of execution units is about the
same for both.( for same maximum
performance).
• A major difference lies in the storage
hierarchy.
• Both rely heavily on multi ported registers.

Cost Comparison
• These registers occupy significant amount
of area. If p is the no of ports, the area
required is
Area = (No of reg +3p)(bits per reg +3p) rbe.
• Most vector processors have 8 sets of 64
element registers with each element being
64 bit in size.
• Each vector register is dual ported ( a read
port and a write port). Since registers are
sequentially accessed each port can be
shared by all elements in the register set.
Cost Comparison
• There is an additional switching overhead
to switch each of n vector registers to each
of p external ports.
Switch area = 2 (bits per reg).p. (no of reg)
• So area used by registers set in vector
processors (supporting 8ports) is-
Area = 8x[(64+6) (64+6)] =39,200 rbe.
Switch area = 2 (64).8.(64) = 8192 rbe.
[Note: Register bit equivalent (rbe) is a useful unit defined to be a 6-transistor register (memory) cell
and represents about 2700 2. Even larger unit, A is defined as 1 mm2 of die area at f = 1 m. This
is also the area occupied by a 32×32 bit three-ported register file or 1481 rbe.]

Cost Comparison
• A multiple issue processor with 32 registers
each having 64 bits and supporting 8 ports
will require
Area = (32+3(8))(64+3(8)) =4928 rbe
• So vector processors use almost 42,464 rbe
of extra area compared to MI processors.
• This extra area corresponds to about 70,800
cache bits ( .6 rbe/bit) ie approximately 8 KB
of data cache.
• Vector processors use small data cache.
Cost Comparison
• Multiple issue machines require larger data
cache to ensure high performance.
• Vector processors require support hardware
for managing access to memory system.
• Also high degree of interleaving is required in
memory system to support processor
bandwidth
• M.I machines must support 4-6 reads and 2-
3 writes per cycle. This increases the area
required by buses between arithmetic units
and registers.

Cost Comparison
• M.I machines should access and hold multiple
instructions each cycle from I- cache
• This increases the size of I-fetch path between
I-cache and instruction decoder/ instruction
register.
• At instruction decoder multiple instructions must
be decoded simultaneously and detection for
instruction independence must be performed.
• The main difference depends heavily on the
size of data cache required by MI machines
and cost of memory interleaving required by
vector processor.
Performance Comparison
• The performance of vector processors
depends primarily on two factors
– Percentage of code that is vectorizable.
– Average length of vectors.
We know that n1/2 or the vector size at which the
vector processor achieves approx half its
asymptotic performance is roughly the same
as arithmetic plus memory access pipeline.
For short vectors data cache is sufficient in MI
machines so for
Short Vectors M.I .processors would perform
better than equivalent vector processor.

Performance Comparison
• As vectors get longer the performance of
M.I. machine becomes much more
dependent on size of data cache and n1/2
of vector processors improve.
• So for long vectors performance would be
better in case of vector processor.
• The actual difference depends largely on
sophistication in compiler technology.
• Compiler can recognize occurrence of
short vector and treat that portion of code
as if it were a scalar code

You might also like