VI. Implicit Parallelism - Instruction Level VI. Implicit Parallelism Instruction Level Parallelism. Pipeline Superscalar & Vector P Processors
VI. Implicit Parallelism - Instruction Level VI. Implicit Parallelism Instruction Level Parallelism. Pipeline Superscalar & Vector P Processors
VI. Implicit Parallelism - Instruction Level VI. Implicit Parallelism Instruction Level Parallelism. Pipeline Superscalar & Vector P Processors
Content
Pipelining
Vector Processors
Superscalar Processors
Programming issues
Pipelining
If the assembly of a car taking 100 time units can be broken into 10 pipelined
stages of 10 units each, the assembly line can produce a car every 10 time units!
It is also evident from this example that to increase the speed of a single pipeline
pipeline,
one would break down the tasks into smaller and smaller units, thus lengthening
the pipeline and increasing overlap in execution.
In the context of processors, this enables faster clock rates since tasks are smaller.
Represents a 10-fold speedup over producing cars entirely serially, one after the other.
Pipelining
The terminology is an analogy to a short length of pipe into which one starts
pushing marbles:
Vector Processor
1.
stage
g 1 of the multiplication
p
of elements ai and bi,
stage 2 of the multiplication of elements ai-1 and bi-1,
stage i of the multiplication of elements a1 and b1,
Following steps:
mth
(n+k-1)th
(m+j)th
((n+m-1)th
)
(Final)
Superscalar execution
Example
adding
g four numbers.
The first and second
instructions are
independent and are
issued concurrently:
load R1, @1000
load R2, @1008 at t=0
The schedule assumes
that each memoryy access
takes a single cycle. In
reality, this may not be the
case
case.
Resource dependency:
co-scheduling of two floating point operations on a dual issue machine with a single
floating point unit
unit.
Instrs cannot be scheduled together since both need the floating point unit.
Microprocessors capabilities
Most current microprocessors are capable of out-of order issue and completion
limited parallelism,
resource dependencies, or
the inability of a processor to extract parallelism,
The hardware logic for dynamic dependency analysis is typically in the range of 5-10%
of the total logic on conventional microprocessors
Complexity grows roughly quadratically with the no.issues & can become a bottleneck.
IIn (c)
( ) 2 execution
ti units
it (multiply-add
( lti l dd units),
it )
Illustrates several zero-issue cycles
Example
...
Additional parallel instructions are made available to the compiler to control parallel execution
First used in Multiflow Trace (1984) & subsequently as a variant in the Intel IA64 arch.
Advantages:
Disadvantages:
compilers do not have the dynamic program state (e.g., the branch history buffer) available to
make scheduling decisions.
Other runtime situations such as stalls on data fetch because of cache misses are extremely
difficult to predict accurately.
Performance is sensitive to the compilers' ability to detect data and resource dependencies
p
and read and write hazards,, and to schedule instructions for maximum parallelism.
1.
Vector and superscalar archs. are united in a group for a no. reasons:
Successful vector archs are close to superscalar archs both ideologically and
in implementation.
a vector
t pipelined
i li d unit
it iis a specialized
i li d clone
l
off th
the general-purpose
l
superscalar
l
pipelined unit, which is optimized for pipelined execution of n successive
instructions performed as the same operation but on different operands.
2.
3.
optimization: the vector pipelined unit does not need a decoding stage and tuses a more
effective data p
pipeline
p
instead of the instruction p
pipeline.
p
A good program for vector processors? It is the programs use of a wide range of vector
instructions that implement basic operations on arrays.
A program intensively using basic ops on arrays is perfectly suitable for superscalar
processors in
i th
thatt it allows
ll
a very-high
hi h llevell off utilization
tili ti off their
th i pipelined
i li d units.
it
even if the superscalar arch looks richer than the vector arch, the real programming
model used
sed for superscalar
s perscalar procs sho
should
ld be the same as for vector
ector procs
procs.
b[i+m-1] = f(a[i+m-1]);
}
/* residual segment res = n mod m */
nms = n/m; res = n%m;
for(i=nms*m;i<nms*m+res;i++){
(
;
; ){
b[i] = f(a[i]);
}
The first loop processes nms segments, each of which does m operations f(a[i]).
Our last loop cleans up the remaining (a residual segment).
Sometimes this residual segment is processed first, sometimes last (as shown) or for
data alignment reasons, part of the res first, the rest last.
We will refer to the instructions which process each f(a[i]) as a template.
Optimization: choose an appropriate depth of unrolling m which permits squeezing all the
m templates together into the tightest time grouping possible.
to hide the time it takes to get the data from memory into registers.
Such data pre-fetching was called bottom loading in former times.
b[n-1] = f(t);
where one tries to hide the next load of a[i] under the loop overhead.
Loop unrolling
nrolling p
purpose:
rpose hide latencies
latencies, in partic
particular,
lar the dela
delay in reading
data from memory
Unless the stored results will be needed in subsequent iterations of the loop
(a data dependency), these stores may always be hidden: their meanderings
into memory can go at least until all the loop iterations are finished
finished.
the various operations: memory fetch, f1 and f2, might run concurrently,
we could set up two templates and try to align them.
Each calculation f(Ai) and f(Ai+1) takes some number (say m) of cycles.
By starting the f(Ai+1) operation steps one cycle after the f(Ai), the two templates can be
merged together.
If these two calculations ran sequentially, they would take twice what each one
requires that is,
requires,
is 2 m.
m
By merging the two together and aligning them to fill in the gaps, they can be
computed in m + 1 cycles.
the separate
th
t operations
ti
can run independently
i d
d tl and
d concurrently,
tl
if it is possible to align the templates to fill in some of the gaps, and
there are enough registers.
If there are only eight registers, alignment of two templates is all that seems possible at
compile time
time.
Data dependency
Counterexample:
for(i=m;i<n;i++){x[i]=f(x[i-k]);}
We or the compiler can unroll the loop in any way desired
desired.
If n m were divisible by 2, consider unrolling the loop above
with a dependency into groups of 2,
for(i=m;i<n;i+=2){
x[i ] =f(x[i-k ]);
x[i+1]=f(x[i-k+1]);}
Example:
for(i=0;i<n;i++){
if(e(a[i])>0.0){
c[i]=f(a[i]);
[ ] ( [ ]);
} else {
c[i]=g(a[i]);
}}
Memory hierarchy
Vector and superscalar processors have a two-level memory hierarchy:
1.
2.
1.
2.
3.
4
4.
Register memory
Cache memory
Main memory
Disk memory
The situation when a data item being referenced is not in the cache is
called cache miss.
If the contribution of data transfer instructions into the total execution time of a
program is substantial,
substantial a low no.
no of cache misses will significantly accelerate the
execution of the program.
An obvious class of programs suitable for such optimization includes programs
intensively using basic operations on arrays.
Loop tiling
Data items accessed by reference b[j] are repeatedly used by successive iterations of loop 1.
If n of iterations of loop 2 is large enough, the data items may be flushed from the cache by the
moment of their repeated use.
To minimize the flushing of repeatedly used data items, no. iterations of loop 2 may decrease
To keep the total no.iterations of this loop nest unchanged, a controlling loop is introduced.