Unit II

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

UNIT II

Topic Covered
Linear pipeline processor, Non-Linear Pipeline Processors, Reservation Table,
Instruction pipeline, Superpipeline and Superscalar technique

Linear pipeline processors

Linear pipelining Pipelining is a technique of that decomposes any sequential process into small
subprocesses, which are independent of each other so that each subprocess can be executed in a
special dedicated segment and all these segments operates concurrently. Thus whole task is
partitioned to independent tasks and these subtask are executed by a segment. The result
obtained as an output of a segment (after performing all computation in it) is transferred to next
segment in pipeline and the final result is obtained after the data have been through all
segments. Thus it could understand if take each segment consists of an input register followed
by a combinational circuit. This combinational circuit performs the required sub operation and
register holds the intermediate result. The output of one combinational circuit is given as input
to the next segment. The concept of pipelining in computer organization is analogous to an
industrial assembly line. As in industry there different division like manufacturing, packing and
delivery division, a product is manufactured by manufacturing division, while it is packed by
packing division a new product is manufactured by manufacturing unit. While this product is
delivered by delivery unit a third product is manufactured by manufacturing unit and second
product has been packed. Thus pipeline results in speeding the overall process. Pipelining can
be effectively implemented for systems having following characteristics:

• A system is repeatedly executes a basic function.


• A basic function must be divisible into independent stages such that each stage have
minimal overlap.

• The complexity of the stages should be roughly similar.

The pipelining in computer organization is basically flow of information. To understand how it


works for the computer system lets consider an process which involves four steps / segment and
the process is to be repeated six times. If single steps take t nsec time then time required to
complete one process is 4 t nsec and to repeat it 6 times we require 24t nsec. Now let’s see how
problem works behaves with pipelining concept. This can be illustrated with a space time
diagram given below figure, which shows the segment utilization as function of time. Lets us
take there are 6 processes to be handled (represented in figure as P1, P2, P3, P4, P5 and P6) and
each process is divided into 4 segments (S1, S2, S3, S4). For sake of simplicity we take each
segment takes equal time to complete the assigned job i.e., equal to one clock cycle. The
horizontal axis displays the time in clock cycles and vertical axis gives the segment number.
Initially, process1 is handled by the segment 1. After the first clock segment 2 handles process 1
and segment 1 handles new process P2. Thus first process will take 4 clock cycles and
remaining processes will be completed one process each clock cycle. Thus for above example
total time required to complete whole job will be 9 clock cycles ( with pipeline organization)
instead of 24 clock cycles required for non pipeline configuration.

1 2 3 4 5 6 7 8 9

P1 S1 S2 S3 S4

P2 S1 S2 S3 S4

P3 S1 S2 S3 S4

P4 S1 S2 S3 S4

P5 S1 S2 S3 S4

P6 S1 S2 S3 S4

Fig: Space Timing Diagram


Speedup ratio : The speed up ratio is ratio between maximum time taken by non pipeline
process over process using pipelining. Thus in general if there are n processes and each process
is divided into k segments (subprocesses). The first process will take k segments to complete the
processes, but once the pipeline is full that is first process is complete, it will take only one
clock period to obtain an output for each process. Thus first process will take k clock cycles and
remaining n-1 processes will emerge from the pipe at the one process per clock cycle thus total
time taken by remaining process will be (n-1) clock cycle time.

Let tp be the one clock cycle time.

The time taken for n processes having k segments in pipeline configuration will be

= k*tp + (n-1)*tp= (k+n-1)*tp

The time taken for one process is tn thus the time taken to complete n process in non pipeline
configuration will be

= n*tn

Thus speed up ratio for one process in non pipeline and pipeline configuration is

= n*tn / (n+k-1)*tp

If n is very large compared to k than

=tn / tp

If a process takes same time in both case with pipeline and non pipeline configuration than

tn = k*tp

Thus speed up ratio will

Sk =k*tp/tp =k

Theoretically maximum speedup ratio will be k where k are the total number of segments in
which process is divided.

The following are various limitations due to which any pipeline system cannot operate at
its maximum theoretical rate i.e., k (speed up ratio).
a. Different segments take different time to complete there suboperations, and in
pipelining clock cycle must be chosen equal to time delay of the segment with maximum
propagation time. Thus all other segments have to waste time waiting for next clock cycle. The
possible solution for improvement here can if possible subdivide the segment into different
stages i.e., increase the number of stages and if segment is not subdivisible than use multiple of
resource for segment causing maximum delay so that more than one instruction can be executed
in to different resources and overall performance will improve.

b. Additional time delay may be introduced because of extra circuitry or additional


software requirement is needed to overcome various hazards, and store the result in the
intermediate registers. Such delays are not found in non pipeline circuit.

c. Further pipelining can be of maximum benefit if whole process can be divided into
suboperations which are independent to each other. But if there is some resource conflict or data
dependency i.e., a instruction depends on the result of pervious instruction which is not yet
available than instruction has to wait till result become available or conditional or non
conditional branching i.e., the bubbles or time delay is introduced.

Efficiency :
The efficiency of linear pipeline is measured by the percentage of time when processor are busy
over total time taken i.e., sum of busy time plus idle time. Thus if n is number of task , k is stage
of pipeline and t is clock period then efficiency is given by
η = n/ [k + n -1]
Thus larger number of task in pipeline more will be pipeline busy hence better will be
efficiency. It can be easily seen from expression as n →∞, η →1.
η = Sk/k
Thus efficiency η of the pipeline is the speedup divided by the number of stages, or one can say
actual speed ratio over ideal speed up ratio. In steady stage where n>>k, η approaches 1.

Throughput:
The number of task completed by a pipeline per unit time is called throughput, this represents
computing power of pipeline. We define throughput as
W= n/[k*t + (n-1) *t] = η/t
In ideal case as η -> 1 the throughout is equal to 1/t that is equal to frequency. Thus maximum
throughput is obtained is there is one output per clock pulse.

Que.1.
A non-pipeline system takes 60 ns to process a task. The same task can be processed in six
segment pipeline with a clock cycle of 10 ns. Determine the speedup ratio of the pipeline for
100 tasks. What is the maximum speed up that can be achieved?
Soln.
Total time taken by for non pipeline to complete 100 task is
= 100 * 60 = 6000 ns
Total time taken by pipeline configuration to complete 100 task is = (100 + 6 –1) *10 = 1050
ns
Thus speed up ratio will be = 6000 / 1050 = 4.76
The maximum speedup that can be achieved for this process is = 60 / 10 = 6
Thus, if total speed of non pipeline process is same as that of total time taken to complete a
process with pipeline than maximum speed up ratio is equal to number of segments.

Que 2.
A non-pipeline system takes 50 ns to process a task. The same task can be processed in a six
segment pipeline with a clock cycle of 10 ns. Determine the speedup ratio of the pipeline for
100 tasks. What is the maximum speed up that can be achieved?
Soln.
Total time taken by for non pipeline to complete 100 task is = 100 * 50 = 5000 ns
Total time taken by pipeline configuration to complete 100 task is = (100 + 6 –1) *10 = 1050
ns Thus speed up ratio will be = 5000 / 1050 = 4.76
The maximum speedup that can be achieved for this process is = 50 / 10 = 5

The two areas where pipeline organization is most commonly used are arithmetic pipeline and
instruction pipeline. An arithmetic pipeline where different stages of an arithmetic operation are
handled along the stages of a pipeline i.e., divides the arithmetic operation into suboperations
for execution of pipeline segments. An instruction pipeline operates on a
stream of instructions by overlapping the fetch, decode, and execute phases of the instruction
cycle as different stages of pipeline.

Non Linear Pipeline


The transfer of control is non linear. A dynamic pipeline allows feed forward and feedback
connections in addition to streamline connection. A dynamic pipelining may initiate tasks from
different reservation tables simultaneously to allow multiple numbers of initiations of different
functions in the same pipeline.

Difference Between Linear and Non-Linear pipeline:


Linear Pipeline Non-Linear Pipeline
Linear pipeline are static Non-Linear pipeline are
pipeline because they are used dynamic pipeline because they
to perform fixed functions. can be reconfigured to perform
variable functions at different
times.
Linear pipeline allows only Non-Linear pipeline allows
streamline connections. feed-forward and feedback
connections in addition to the
streamline connection.
It is relatively easy to Function partitioning is
partition a given function into relatively difficult because the
a sequence of linearly ordered pipeline stages are
sub functions. interconnected with loops in
addition to streamline
connections.
The Output of the pipeline is The Output of the pipeline is
not necessarily produced from
produced from the last stage.
the last stage.

The reservation table is trivial The reservation table is non-


in the sense that data flows in trivial in the sense that there is
linear streamline. no linear streamline for data
flows.
Static pipelining is specified Dynamic pipelining is
by single Reservation table. specified by more than one
Reservation table.
All initiations to a static A dynamic pipeline may allow
pipeline use the same different initiations to follow a
reservation table. mix of reservation tables.

Reservation Table
Reservation tables are used how successive pipeline stages are utilized for a specific evaluation
function. These reservation tables show the sequence in which each function

Reservation Table: Displays the time space flow of data through the pipeline for one function evaluation.
Time/Stage 1 2 3 4 5 6 7 8
X X X
X X
X X X

S1
S2
S3

Reservation function for a function x

Latency: The number of time units (clock cycles) between two initiations of a pipeline is the
latency between them. Latency values must be non-negative integers.

Collision: When two or more initiations are done at same pipeline stage at the same time will
cause a collision. A collision implies resource conflicts between two initiations in the pipeline,
so it should be avoided.

Forbidden and Permissible Latency: Latencies that cause collisions are called forbidden
latencies. (E.g. in above reservation table 2, 4, 5 and 7 are forbidden latencies).

Latencies that do not cause any collision are called permissible latencies. (E.g. in above
reservation table 1, 3 and 6 are permissible latencies).

Latency Sequence and Latency Cycle: A Latency Sequence is a sequence of permissible


non-forbidden latencies between successive task initiations.

A Latency cycle is a latency sequence which repeats the same subsequence (cycle) indefinitely.

The Average Latency of a latency cycle is obtained by dividing the sum of all latencies by the
number of latencies along the cycle.

The latency cycle (1, 8) has an average latency of (1+8)/2=4.5.

A Constant Cycle is a latency cycle which contains only one latency value. (E.g. Cycles (3)
and (6) both are constant cycle).
Collision Vector: The combined set of permissible and forbidden latencies can be easily displayed
by a collision vector, which is an m-bit (m<=n-1 in a n column reservation table) binary vector
C=(CmCm-1….C2C1). The value of Ci=1 if latency I causes a collision and Ci=0 if latency i is
permissible. (E.g. Cx= (1011010)).

State Diagram: Specifies the permissible state transitions among successive initiations based on
the collision vector.

Simple Cycle, Greedy Cycle and MAL: A Simple Cycle is a latency cycle in which each state
appears only once. In above state diagram only (3), (6), (8), (1, 8), (3, 8), and (6, 8) are simple
cycles. The cycle(1, 8, 6, 8) is not simple as it travels twice through state (1011010).

A Greedy Cycle is a simple cycle whose edges are all made with minimum latencies from their
respective starting states. The cycle (1, 8) and (3) are greedy cycles.

MAL (Minimum Average Latency) is the minimum average latency obtained from the greedy
cycle. In greedy cycles (1, 8) and (3), the cycle (3) leads to MAL value 3.

Instruction pipeline
An instruction pipeline reads consecutive instruction from memory while previous instruction
are being executed in other segment.
Instruction Execution Phase
Consider an instruction pipeline of 4 stages
S1 : Fetch the instruction from the memory (FI).
S2 : Decode the instruction (DA).
S3 : Fetch the operands from the memory (FO).
S4 : Execute the instruction (EX).
Step/Instructio 1 2 3 4 5 6 7 8 9
n Fig: Space timing Diagram

1 FI DA FO EX Que : Consider a program of

2 FI DA FO EX 15,000 instructions executed by


a linear pipeline processor with
3 FI DA FO EX
a clock rate of 25MHz. The
4 FI DA FO EX instruction pipeline has five
stages and one instruction is issued per clock cycle. Calculate speed up ratio, efficiency and
throughput of this pipelined processor?
Soln: Time taken to execute without pipeline is = 15000 * 5* (1/25) microsecs
Time taken with pipeline = (15000 + 5 -1)*(1/ 25) microsecs
Speed up ratio = (15000*5*25) / (15000+ 5 -1)*25 = 4.99
Efficiency = Speed up ratio/ number of segment in pipeline = 4.99/5= 0.99
Throughput = number of task completed in unit time = 0.99 * 25 = 24.9 MIPS
Principles of designing pipeline processor
Buffers are used to speed close up the speed gap between memory access for either instructions
or operands. Buffering can avoid unnecessary idling of the processing stages caused by memory
access conflicts or by unexpected branching or interrupts. The concepts of busing eliminates the
time delay to store and to retrieve intermediate results or to from the registers.
The computer performance can be greatly enhanced if one can eliminate unnecessary memory
accesses and combine transitive or multiple fetch-store operations with faster register
operations. This is carried by register tagging and forwarding.
Another method to smooth the traffic flow in a pipeline is to use buffers to close up the speed
gap between the memory accesses for either instructions or operands and arithmetic and logic
executions in the functional pipes. The instruction or operand buffers provide a continuous
supply of instructions or operands to the appropriate pipeline units. Buffering can avoid
unnecessary idling of the processing stages caused by memory access conflicts or by
unexpected branching or interrupts. Sometimes the entire loop instructions can be stored in the
buffer to avoid repeated fetch of the same instructions loop, if the buffer size is sufficiently
large. It is very large in the usage of pipeline computers.
Three buffer types are used in various instructions and data types. Instructions are fetched to the
instruction fetch buffer before sending them to the instruction unit. After decoding, fixed point
and floating point instructions and data are sent to their dedicated buffers. The store address and
data buffers are used for continuously storing results back to memory. The storage conflict
buffer is used only used when memory
Busing Buffers
The sub function being executed by one stage should be independent of the other sub functions
being executed by the remaining stages; otherwise some process in the pipeline must be halted
until the dependency is removed. When one instruction waiting to be executed is first to be
modified by a future instruction, the execution of this instruction must be suspended until the
dependency is released.
Another example is the conflicting use of some registers or memory locations by different
segments of a pipeline. These problems cause additional time delays. An efficient internal
busing structure is desired to route the resulting stations with minimum time delays.
In the AP 120B or FPS 164 attached processor the busing structure are even more sophisticated.
Seven data buses provide multiple data paths. The output of the floating point adder in the AP
120B can be directly routed back to the input of the floating point adder, to the input of the
floating point multiplier, to the data pad, or to the data memory. Similar busing is provided for
the output of the floating point multiplier. This eliminates the time delay to store and to retrieve
intermediate results or to from the registers.
Internal Forwarding and Register Tagging
To enhance the performance of computers with multiple execution pipelines
1. Internal Forwarding refers to a short circuit technique for replacing unnecessary
memory accesses by register -to-register transfers in a sequence of fetch-arithmetic-store
operations
2. Register Tagging refers to the use of tagged registers, buffers and reservations stations
for exploiting concurrent activities among multiple arithmetic units.
The computer performance can be greatly enhanced if one can eliminate unnecessary memory
accesses and combine transitive or multiple fetch-store operations with faster register
operations. This concept of internal data forwarding can be explored in three directions. The
symbols Mi and Rj to represent the ith word in the memory and jth fetch, store and register-to
register transfer. The contents of Mi and Rj are represented by (Mi) and Rj
Store-Fetch Forwarding
The store the n fetch can be replaced by 2 parallel operations, one store and one register transfer.
2 memory accesses
Mi -> (R1) (store)
R2 -> (Mi) (Fetch)
Is being replaced by
Only one memory access
Mi -> (R1) (store)
R2 -> (R1) (register Transfer)
Fetch-Fetch Forwarding
The following fetch operations can be replaced by one fetch and one register transfer. One
memory access has been eliminated.
2 memory accesses
R1 -> (Mi) (fetch)
R2 -> (Mi) (Fetch)
Is being replaced by
Only one memory access
R1 -> (Mi) (Fetch)
R2 -> (R1) (register Transfer)
Store-Store Overwriting
The following two memory updates
of the same word can be combined
into one; since the second store
overwrites the first. 2 memory
accesses
Mi -> (R1) (store)

Mi -> (R2) (store) Is


being replaced by
Only one memory access
Mi -> (R2) (store)

Forwarding and Data Hazards


Sometimes it is possible to avoid data hazards by noting that a value that results from one
instruction is not needed until a late stage in a following instruction, and sending the data
directly from the output of the first functional unit back to the input of the second one (which is
sometimes the same unit). In the general case, this would require the output of every functional
unit to be connected through switching logic to the input of every functional unit.
Data hazards can take three forms:
Read after write (RAW): Attempting to read a value that hasn't been written yet. This is the
most common type, and can be overcome by forwarding.
Write after write (WAW): Writing a value before a preceding write has completed. This can only
happen in complex pipes that allow instructions to proceed out of order, or that have multiple
write-back stages (mostly CISC), or when we have multiple pipes that can write (superscalar).
Write after read (WAR): Writing a value before a preceding read has completed. These also
require a complex pipeline that can sometimes write in an early stage, and read in a later stage.
It is also possible when multiple pipelines (superscalar) or out-of-order issue are employed.
The fourth situation, read after read (RAR) does not produce a hazard.
Forwarding does not solve every RAW hazard situation. For example, if a functional unit is
merely slow and fails to produce a result that can be forwarded in time, then the pipeline must
stall. A simple example is the case of a load, which has a high latency. This is the sort of
situation where compiler scheduling of instructions can help, by rearranging independent
instructions to fill the delay slots. The processor can also rearrange the instructions at run time,
if it has access to a window of prefetched instructions (called a prefetch buffer). It must perform
much the same analysis as the compiler to determine which instructions are dependent on each
other, but because the window is usually small, the analysis is more limited in scope. The small
size of the window is due to the cost of providing a wide enough datapath to predecode multiple
instructions at once, and the complexity of the dependence testing logic.
Branch Penalty Hiding
The control hazards due to branches can cause a large part of the pipeline to be flushed, greatly
reducing its performance. One way of hiding the branch penalty is to fill the pipe behind the
branch with instructions that would be executed whether or not the branch is taken. If we can
find the right number of instructions that precede the branch and are independent of the test,
then the compiler can move them immediately following the branch and tag them as branch
delay filling instructions. The processor can then execute the branch,
and when it determines the appropriate target, the instruction is fetched into the pipeline with
no penalty.
Of course, this scheme depends on how the pipeline is designed. It effectively binds part of the
pipeline design for a specific implementation to the instruction set architecture. As we've seen
before, it is generally a bad idea to bind implementation details to the ISA because we may need
to change them later on. For example, if we decide to lengthen the pipeline so that the number
of delay slots increases, we have to recompile our code to have it execute efficiently -- we no
longer have strict binary compatibility among models of the same "ISA". The filling of branch
delays can be done dynamically in hardware by reordering instructions out of the prefetch
buffer. But this leads to other problems. Another way to hide branch penalties is to avoid certain
kinds of branches. For example, if we have
IF A < 0
THEN A = -A
we would normally implement this with a nearby branch. However, we could instead use an
instruction that performs the arithmetic conditionally (skips the write back if the condition
fails). The advantage of this scheme is that, although one pipeline cycle is wasted, we do not
have to flush the rest of the pipe (also, for a dynamic branch prediction scheme, we need not put
an extra branch into the prediction unit). These are called predicated instructions, and the
concept can be extended to other sorts of operations, such as conditional loading of a value from
memory.
Branch Prediction
Branches are the bane of any pipeline, causing a potentially large decrease in performance as we
saw earlier. There are several ways to reduce this loss by predicting the action of the branch
ahead of time.
Simple static prediction assumes that all branches will be taken or not. The designer decides
which way is predicted from instruction trace statistics. Once the choice is made, the compiler
can help by properly ordering local jumps. A slightly more complex static branch prediction
heuristic is that backward branches are usually taken and forward branches are not (backwards
taken, forwards not or BTFN). This assumes that most backward branches are loop returns and
that most forward branches are the less likely cases of a conditional branch. Compiler static
prediction involves the use of special branches that indicate the most likely choice (taken or not,
or more typically taken or other, since the most predictable branches are
those at the ends of loops that are mostly taken). If the prediction fails in this case, then the
usual cancellation of the instructions in the delay slots occurs and a branch penalty results.
Dynamic instruction scheduling
As discussed above the static instruction scheduling can be optimized by compiler the dynamic
scheduling is achived either by using scoreboard or with Tomasulo’s register tagging algorithm
and discussed in superscalar processors.

superpipeline and Superscalar technique

Instruction level parallelism is obtained primarily in two ways in uniprocessors: through


pipelining and through keeping multiple functional units busy executing multiple instructions at
the same time. When a pipeline is extended in length beyond the normal five or six stages (e.g.,
I-Fetch, Decode/Dispatch, Execute, D-fetch, Writeback), then it may be called Superpipelined.
If a processor executes more than one instruction at a time, it may be called Superscalar. A
superscalar architecture is one in which several instructions can be initiated simultaneously and
executed independently These two techniques can be combined into a Superscalar pipeline
architecture.
Superpipeline

Superpipelining is based on dividing the stages of a pipeline into substages and thus increasing
the number of instructions which are supported by the pipeline at a given moment. For example
if we divide each stage into two, the clock cycle period t will be reduced to the half, t/2; hence,
at the maximum capacity, the pipeline produces a result every t/2 s. For a given architecture and
the corresponding instruction set there is an optimal number of pipeline stages; increasing the
number of stages over this limit reduces the overall performance. A solution to further improve
speed is the superscalar architecture. Given a pipeline stage time T, it may be possible to
execute at a higher rate by starting operations at intervals of T/n. This can be accomplished in
two ways:
_ Further divide each of the pipeline stages into n substages.
_ Provide n pipelines that are overlapped.
The first approach requires faster logic and the ability to subdivide the stages into segments with
uniform latency. It may also require more complex inter-stage interlocking and stall- restart
logic.
The second approach could be viewed in a sense as staggered superscalar operation, and has
associated with it all of the same requirements except that instructions and data can be fetched
with a slight offset in time. In addition, inter-pipeline interlocking is more difficult to manage
because of the sub-clock period differences in timing between the pipelines.
Even so, staggered clock pipelines may be necessary with superscalar designs in the future, in
order to reduce peak power and corresponding power-supply induced noise. Alternatively,
designs may be forced to shift to a balanced mode of circuit operation in which logic transitions
are balanced by reverse transitions -- a technique used in the Cray supercomputers that resulted
in the computer presenting a pure DC load to the power supply, and greatly reduced noise in the
system.
Inevitably, superpipelining is limited by the speed of logic, and the frequency of unpredictable
branches. Stage time cannot productively grow shorter than the interstage latch time, and so this
is a limit for the number of stages.
The MIPS R4000 is sometimes called a superpipelined machine, although its 8 stages really
only split the I-fetch and D-fetch stages of the pipe and add a Tag Check stage. Nonetheless, the
extra stages enable it to operate with higher throughput. The UltraSPARC's 9-stage pipe
definitely qualifies it as a superpipelined machine, and in fact it is a Super-Super design because
of its superscalar issue. The Pentium 4 splits the pipeline into 20 stages to enable increased
clock rate. The benefit of such extensive
pipelining is really only gained for very regular applications such as graphics. On more irregular
applications, there is little performance advantage.

Superscalar

Superscalar processing has its origins in the Cray-designed CDC supercomputers, in which
multiple functional units are kept busy by multiple instructions. The CDC machines could pack
as many as 4 instructions in a word at once, and these were fetched together and
dispatched via a pipeline. Given the technology of the time, this configuration was fast enough
to keep the functional units busy without outpacing the instruction memory.
Current technology has enabled, and at the same time created the need to issue instructions in
parallel. As execution pipelines have approached the limits of speed, parallel execution has been
required to improve performance. As this requires greater fetch rates from memory, which hasn't
accelerated comparably, it has become necessary to fetch instructions in parallel
-- fetching serially and pipelining their dispatch can no longer keep multiple functional units
busy. At the same time, the movement of the L1 instruction cache onto the chip has permitted
designers to fetch a cache line in parallel with little cost.
In some cases superscalar machines still employ a single fetch-decode-dispatch pipe that drives
all of the units. For example, the UltraSPARC splits execution after the third stage of a unified
pipeline. However, it is becoming more common to have multiple fetch-decode- dispatch pipes
feeding the functional units.
The choice of approach depends on tradeoffs of the average execute time vs. the speed with
which instructions can be issued. For example, if execution averages several cycles, and the
number of functional units is small, then a single pipe may be able to keep the units utilized.
When the number of functional units grows large and/or their execution time approaches the
issue time, then multiple issue pipes may be necessary.
Having multiple issue pipes requires

• being able to fetch instructions for that many pipes at once


• inter-pipeline interlocking
• reordering of instructions for multiple interlocked pipelines
• multiple write-back stages
• multiport D-cache and/or register file, and/or functionally split register file

Reordering may be either static (compiler) or dynamic (using hardware lookahead). It can be
difficult to combine the two approaches because the compiler may not be able to predict the
actions of the hardware reordering mechanism.
Superscalar operation is limited by the number of independent operations that can be extracted
from an instruction stream. It has been shown in early studies on simpler processor models, that
this is limited, mostly by branches, to a small number (<10, typically about 4).
More recent work has shown that, with speculative execution and aggressive branch prediction,
higher levels may be achievable. On certain highly regular codes, the level of parallelism may
be quite high (around 50). Of course, such highly regular codes are just as amenable to other
forms of parallel processing that can be employed more directly, and are also the exception
rather than the rule. Current thinking is that about 6-way instruction level parallelism for a
typical program mix may be the natural limit, with 4-way being likely for integer codes.
Potential ILP may be three times this, but it will be very difficult to exploit even a majority of
this parallelism. Nonetheless, obtaining a factor of 4 to 6 boost in performance is quite
significant, especially as processor speeds approach their limits.
Going beyond a single instruction stream and allowing multiple tasks (or threads) to operate at
the same time can enable greater system throughput. Because these are naturally independent at
the fine-grained level, we can select instructions from different streams to fill pipeline slots that
would otherwise go vacant in the case of issuing from a single thread. In turn, this makes it
useful to add more functional units. We shall further explore these multithreaded architectures
later in the course.

Hardware Support for Superscalar Operation


There are two basic hardware techniques that are used to manage the simultaneous execution of
multiple instructions on multiple functional units: Scoreboarding and reservation stations.
Scoreboarding originated in the Cray-designed CDC-6600 in 1964, and reservation stations first
appeared in the IBM 360/91 in 1967, as designed by Tomasulo.

Scoreboard
A scoreboard is a centralized table that keeps track of the instructions to be performed and the
available resources and issues the instructions to the functional units when everything is ready
for them to proceed. As the instructions execute, dependences are checked and execution is
stalled as necessary to ensure that in-order semantics are preserved. Out of order execution is
possible, but is limited by the size of the scoreboard and the execution rules. The scoreboard can
be thought of as preceding dispatch, but it also controls execution after the issue. In a
scoreboarded system, the results can be forwarded directly to their destination register (as long
as there are no write after read hazards, in which case their execution is stalled), rather than
having to proceed to a final write-back stage.
In the CDC scoreboard, each register has a matching Result Register Designator that indicates
which functional unit will write a result into it. The fact that only one functional unit can be
designated for writing to a register at a time ensures that WAW dependences cannot occur. Each
functional unit also has a corresponding set of Entry-Operand Register Designators that indicate
what register will hold each operand, whether the value is valid (or pending) and if it is pending,
what functional unit will produce it (to facilitate forwarding). None of the operands is released
to a functional unit until they are all valid, precluding RAW dependences. In addition , the
scoreboard stalls any functional unit whose result would write a register that is still listed as an
Entry-Operand to a functional unit that is waiting for an operand or is busy, thus avoiding WAR
violations. An instruction is only allowed to issue if its specified functional unit is free and its
result register is not reserved by another functional unit that has not yet completed. Four Stages
of Scoreboard Control
1. Issue—decode instructions & check for structural hazards (ID1) If a functional unit
for the instruction is free and no other active instruction has the same destination register
(WAW), the scoreboard issues the instruction to the functional unit and updates its internal data
structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further
instructions will issue until these hazards are cleared.
2. Read operands—wait until no data hazards, then read operands (ID2) A source
operand is available if no earlier issued active instruction is going to write it, or if the register
containing the operand is being written by a currently active functional unit. When the source
operands are available, the scoreboard tells the functional unit to proceed to read the operands
from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in
this step, and instructions may be sent into execution out of order.
3. Execution—operate on operands (EX) The functional unit begins execution upon
receiving operands. When the result is ready, it notifies the scoreboard that it has completed
execution.
4. Write result—finish execution (WB) Once the scoreboard is aware that the functional
unit has completed execution, the scoreboard checks for WAR hazards.
If none, it writes results. If WAR, then it stalls the instruction. Example:
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD reads operands
Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in
2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each
functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready and not yet read. Set to
No after operands are read.
3. Register result status—Indicates which functional unit will write each register, if one
exists. Blank when no pending instructions will write that register
Scoreboard Implications
• provide solution for WAR, WAW hazards
• Solution for WAR – Stall Write in WB to allow Reads to take place; Read registers only
during Read Operands stage.
• For WAW, must detect hazard: stall in the Issue stage until other completes
• Need to have multiple instructions in execution phase
• Scoreboard keeps track of dependencies, state or operations
– Monitors every change in the hardware.
– Determines when to read ops, when can execute, when can wb.
– Hazard detection and resolution is centralized.
Reservation Stations The reservation station approach releases instructions directly to a pool of
buffers associated with their intended functional units (if more than one unit of a particular type
is present, then the units may share a single station). The reservation stations are a distributed
resource, rather than being centralized, and can be thought of as following dispatch. A
reservation is a record consisting of an instruction and its requirements to execute
-- its operands as specified by their sources and destination and bits indicating when valid
values are available for the sources. The instruction is released to the functional unit when its
requirements are satisfied, but it is important to note that satisfaction doesn't require an
operand to actually be in a register -- it can be forwarded to the reservation station for
immediate release or to be buffered (see below) for later release. Thus, the reservation station's
influence on execution can be thought of as more implicit and data dependent than the explicit
control exercised by the scoreboard.
Tomasulo Algorithm
The hardware dependence resolution technique used For IBM 360/91 about 3 years after
CDC 6600. Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station free, then issue instruction & send operands (renames registers).
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units; mark reservation station available.
Here the storage of operands resulting from instructions that completed out of order is done
through renaming of the registers. There are two mechanisms commonly used for renaming.
One is to assign physical registers from a free pool to the logical registers as they are identified
in an instruction stream. A lookup table is then used to map the logical register references to
their physical assignments. Usually the pool is larger than the logical register set to allow for
temporary buffering of results that are computed but not yet ready to write back. Thus, the
processor must keep track of a larger set of register names than the instruction set architecture
specifies. When the pool is empty, instruction issue stalls.
The other mechanism is to keep the traditional association of logical and physical registers, but
then provide additional buffers either associated with the reservation stations or kept in a central
location. In either case, each of these "reorder buffers" is associated with a given instruction,
and its contents (once computed) can be used in forwarding operations as long as the instruction
has not completed.
When an instruction reaches the point that it may complete in a manner that preserves sequential
semantics, then its reservation station is freed and its result appears in the logical register that
was originally specified. This is done either by renaming the temporary register to be one of the
logical registers, or by transferring the contents of the reorder buffer to the appropriate physical
register.

Out-of-Order Issue
To enable out-of-order dispatch of instructions to the pipelines, we must provide at least two
reservation stations per pipe that are available for issue at once. An alternative would
be to rearrange instructions in the prefetch buffer, but without knowing the status of the pipes, it
would be difficult to make such a reordering effective. By providing multiple reservation
stations, however, we can continue issuing instructions to pipes, even though an instruction may
be stalled while awaiting resources. Then, whatever instruction is ready first can enter the pipe
and execute. At the far end of the pipeline, the out-of-order instruction must wait to be retired in
the proper order. This necessitates a mechanism for keeping track of the proper order of
instructions (note that dependences alone cannot guarantee that instructions will be properly
reordered when they complete).

Superscalar-Superpipeline
Of course we may also combine superscalar operation with superpipelining. The result is
potentially the product of the speedup factors.
However, it is even more difficult to interlock between parallel pipes that are divided into many
stages. Also, the memory subsystem must be able to sustain a level of instruction throughput
corresponding to the total throughput of the multiple pipelines -- stretching the
processor/memory performance gap even more. Of course, with so many pipes and so many
stages, branch penalties become huge, and branch prediction becomes a serious bottleneck
(offsetting this somewhat is the potential for speculative branch execution that arises with a
large number of pipes).
But the real problem may be in finding the parallelism required to keep all of the pipes and
stages busy between branches. Consider that a machine with 12 pipelines of 20 stages must
always have access to a window of 240 instructions that are scheduled so as to avoid all
hazards, and that the average of 40 branches that would be present in a block of that size are all
correctly predicted sufficiently in advance to avoid stalling in the prefetch unit. This is a tall
order, even for highly regular code with a good balance of floating point to integer operations,
let alone irregular code that is typically unbalanced in its operation mix. As usual, scientific
code and image processing with their regular array operations often provide the best
performance for a super-super processor. However, a vector unit could more directly address the
array processing problems, and what is left to the scalar unit may not benefit nearly so
greatly from all of this complexity. Scalar processing can certainly benefit from ILP in many
cases, but probably in patterns that differ from vector processing.

References/Suggested readings

Advance Computer architecture: Kai Hwang

You might also like