Unit II
Unit II
Unit II
Topic Covered
Linear pipeline processor, Non-Linear Pipeline Processors, Reservation Table,
Instruction pipeline, Superpipeline and Superscalar technique
Linear pipelining Pipelining is a technique of that decomposes any sequential process into small
subprocesses, which are independent of each other so that each subprocess can be executed in a
special dedicated segment and all these segments operates concurrently. Thus whole task is
partitioned to independent tasks and these subtask are executed by a segment. The result
obtained as an output of a segment (after performing all computation in it) is transferred to next
segment in pipeline and the final result is obtained after the data have been through all
segments. Thus it could understand if take each segment consists of an input register followed
by a combinational circuit. This combinational circuit performs the required sub operation and
register holds the intermediate result. The output of one combinational circuit is given as input
to the next segment. The concept of pipelining in computer organization is analogous to an
industrial assembly line. As in industry there different division like manufacturing, packing and
delivery division, a product is manufactured by manufacturing division, while it is packed by
packing division a new product is manufactured by manufacturing unit. While this product is
delivered by delivery unit a third product is manufactured by manufacturing unit and second
product has been packed. Thus pipeline results in speeding the overall process. Pipelining can
be effectively implemented for systems having following characteristics:
1 2 3 4 5 6 7 8 9
P1 S1 S2 S3 S4
P2 S1 S2 S3 S4
P3 S1 S2 S3 S4
P4 S1 S2 S3 S4
P5 S1 S2 S3 S4
P6 S1 S2 S3 S4
The time taken for n processes having k segments in pipeline configuration will be
The time taken for one process is tn thus the time taken to complete n process in non pipeline
configuration will be
= n*tn
Thus speed up ratio for one process in non pipeline and pipeline configuration is
= n*tn / (n+k-1)*tp
=tn / tp
If a process takes same time in both case with pipeline and non pipeline configuration than
tn = k*tp
Sk =k*tp/tp =k
Theoretically maximum speedup ratio will be k where k are the total number of segments in
which process is divided.
The following are various limitations due to which any pipeline system cannot operate at
its maximum theoretical rate i.e., k (speed up ratio).
a. Different segments take different time to complete there suboperations, and in
pipelining clock cycle must be chosen equal to time delay of the segment with maximum
propagation time. Thus all other segments have to waste time waiting for next clock cycle. The
possible solution for improvement here can if possible subdivide the segment into different
stages i.e., increase the number of stages and if segment is not subdivisible than use multiple of
resource for segment causing maximum delay so that more than one instruction can be executed
in to different resources and overall performance will improve.
c. Further pipelining can be of maximum benefit if whole process can be divided into
suboperations which are independent to each other. But if there is some resource conflict or data
dependency i.e., a instruction depends on the result of pervious instruction which is not yet
available than instruction has to wait till result become available or conditional or non
conditional branching i.e., the bubbles or time delay is introduced.
Efficiency :
The efficiency of linear pipeline is measured by the percentage of time when processor are busy
over total time taken i.e., sum of busy time plus idle time. Thus if n is number of task , k is stage
of pipeline and t is clock period then efficiency is given by
η = n/ [k + n -1]
Thus larger number of task in pipeline more will be pipeline busy hence better will be
efficiency. It can be easily seen from expression as n →∞, η →1.
η = Sk/k
Thus efficiency η of the pipeline is the speedup divided by the number of stages, or one can say
actual speed ratio over ideal speed up ratio. In steady stage where n>>k, η approaches 1.
Throughput:
The number of task completed by a pipeline per unit time is called throughput, this represents
computing power of pipeline. We define throughput as
W= n/[k*t + (n-1) *t] = η/t
In ideal case as η -> 1 the throughout is equal to 1/t that is equal to frequency. Thus maximum
throughput is obtained is there is one output per clock pulse.
Que.1.
A non-pipeline system takes 60 ns to process a task. The same task can be processed in six
segment pipeline with a clock cycle of 10 ns. Determine the speedup ratio of the pipeline for
100 tasks. What is the maximum speed up that can be achieved?
Soln.
Total time taken by for non pipeline to complete 100 task is
= 100 * 60 = 6000 ns
Total time taken by pipeline configuration to complete 100 task is = (100 + 6 –1) *10 = 1050
ns
Thus speed up ratio will be = 6000 / 1050 = 4.76
The maximum speedup that can be achieved for this process is = 60 / 10 = 6
Thus, if total speed of non pipeline process is same as that of total time taken to complete a
process with pipeline than maximum speed up ratio is equal to number of segments.
Que 2.
A non-pipeline system takes 50 ns to process a task. The same task can be processed in a six
segment pipeline with a clock cycle of 10 ns. Determine the speedup ratio of the pipeline for
100 tasks. What is the maximum speed up that can be achieved?
Soln.
Total time taken by for non pipeline to complete 100 task is = 100 * 50 = 5000 ns
Total time taken by pipeline configuration to complete 100 task is = (100 + 6 –1) *10 = 1050
ns Thus speed up ratio will be = 5000 / 1050 = 4.76
The maximum speedup that can be achieved for this process is = 50 / 10 = 5
The two areas where pipeline organization is most commonly used are arithmetic pipeline and
instruction pipeline. An arithmetic pipeline where different stages of an arithmetic operation are
handled along the stages of a pipeline i.e., divides the arithmetic operation into suboperations
for execution of pipeline segments. An instruction pipeline operates on a
stream of instructions by overlapping the fetch, decode, and execute phases of the instruction
cycle as different stages of pipeline.
Reservation Table
Reservation tables are used how successive pipeline stages are utilized for a specific evaluation
function. These reservation tables show the sequence in which each function
Reservation Table: Displays the time space flow of data through the pipeline for one function evaluation.
Time/Stage 1 2 3 4 5 6 7 8
X X X
X X
X X X
S1
S2
S3
Latency: The number of time units (clock cycles) between two initiations of a pipeline is the
latency between them. Latency values must be non-negative integers.
Collision: When two or more initiations are done at same pipeline stage at the same time will
cause a collision. A collision implies resource conflicts between two initiations in the pipeline,
so it should be avoided.
Forbidden and Permissible Latency: Latencies that cause collisions are called forbidden
latencies. (E.g. in above reservation table 2, 4, 5 and 7 are forbidden latencies).
Latencies that do not cause any collision are called permissible latencies. (E.g. in above
reservation table 1, 3 and 6 are permissible latencies).
A Latency cycle is a latency sequence which repeats the same subsequence (cycle) indefinitely.
The Average Latency of a latency cycle is obtained by dividing the sum of all latencies by the
number of latencies along the cycle.
A Constant Cycle is a latency cycle which contains only one latency value. (E.g. Cycles (3)
and (6) both are constant cycle).
Collision Vector: The combined set of permissible and forbidden latencies can be easily displayed
by a collision vector, which is an m-bit (m<=n-1 in a n column reservation table) binary vector
C=(CmCm-1….C2C1). The value of Ci=1 if latency I causes a collision and Ci=0 if latency i is
permissible. (E.g. Cx= (1011010)).
State Diagram: Specifies the permissible state transitions among successive initiations based on
the collision vector.
Simple Cycle, Greedy Cycle and MAL: A Simple Cycle is a latency cycle in which each state
appears only once. In above state diagram only (3), (6), (8), (1, 8), (3, 8), and (6, 8) are simple
cycles. The cycle(1, 8, 6, 8) is not simple as it travels twice through state (1011010).
A Greedy Cycle is a simple cycle whose edges are all made with minimum latencies from their
respective starting states. The cycle (1, 8) and (3) are greedy cycles.
MAL (Minimum Average Latency) is the minimum average latency obtained from the greedy
cycle. In greedy cycles (1, 8) and (3), the cycle (3) leads to MAL value 3.
Instruction pipeline
An instruction pipeline reads consecutive instruction from memory while previous instruction
are being executed in other segment.
Instruction Execution Phase
Consider an instruction pipeline of 4 stages
S1 : Fetch the instruction from the memory (FI).
S2 : Decode the instruction (DA).
S3 : Fetch the operands from the memory (FO).
S4 : Execute the instruction (EX).
Step/Instructio 1 2 3 4 5 6 7 8 9
n Fig: Space timing Diagram
Superpipelining is based on dividing the stages of a pipeline into substages and thus increasing
the number of instructions which are supported by the pipeline at a given moment. For example
if we divide each stage into two, the clock cycle period t will be reduced to the half, t/2; hence,
at the maximum capacity, the pipeline produces a result every t/2 s. For a given architecture and
the corresponding instruction set there is an optimal number of pipeline stages; increasing the
number of stages over this limit reduces the overall performance. A solution to further improve
speed is the superscalar architecture. Given a pipeline stage time T, it may be possible to
execute at a higher rate by starting operations at intervals of T/n. This can be accomplished in
two ways:
_ Further divide each of the pipeline stages into n substages.
_ Provide n pipelines that are overlapped.
The first approach requires faster logic and the ability to subdivide the stages into segments with
uniform latency. It may also require more complex inter-stage interlocking and stall- restart
logic.
The second approach could be viewed in a sense as staggered superscalar operation, and has
associated with it all of the same requirements except that instructions and data can be fetched
with a slight offset in time. In addition, inter-pipeline interlocking is more difficult to manage
because of the sub-clock period differences in timing between the pipelines.
Even so, staggered clock pipelines may be necessary with superscalar designs in the future, in
order to reduce peak power and corresponding power-supply induced noise. Alternatively,
designs may be forced to shift to a balanced mode of circuit operation in which logic transitions
are balanced by reverse transitions -- a technique used in the Cray supercomputers that resulted
in the computer presenting a pure DC load to the power supply, and greatly reduced noise in the
system.
Inevitably, superpipelining is limited by the speed of logic, and the frequency of unpredictable
branches. Stage time cannot productively grow shorter than the interstage latch time, and so this
is a limit for the number of stages.
The MIPS R4000 is sometimes called a superpipelined machine, although its 8 stages really
only split the I-fetch and D-fetch stages of the pipe and add a Tag Check stage. Nonetheless, the
extra stages enable it to operate with higher throughput. The UltraSPARC's 9-stage pipe
definitely qualifies it as a superpipelined machine, and in fact it is a Super-Super design because
of its superscalar issue. The Pentium 4 splits the pipeline into 20 stages to enable increased
clock rate. The benefit of such extensive
pipelining is really only gained for very regular applications such as graphics. On more irregular
applications, there is little performance advantage.
Superscalar
Superscalar processing has its origins in the Cray-designed CDC supercomputers, in which
multiple functional units are kept busy by multiple instructions. The CDC machines could pack
as many as 4 instructions in a word at once, and these were fetched together and
dispatched via a pipeline. Given the technology of the time, this configuration was fast enough
to keep the functional units busy without outpacing the instruction memory.
Current technology has enabled, and at the same time created the need to issue instructions in
parallel. As execution pipelines have approached the limits of speed, parallel execution has been
required to improve performance. As this requires greater fetch rates from memory, which hasn't
accelerated comparably, it has become necessary to fetch instructions in parallel
-- fetching serially and pipelining their dispatch can no longer keep multiple functional units
busy. At the same time, the movement of the L1 instruction cache onto the chip has permitted
designers to fetch a cache line in parallel with little cost.
In some cases superscalar machines still employ a single fetch-decode-dispatch pipe that drives
all of the units. For example, the UltraSPARC splits execution after the third stage of a unified
pipeline. However, it is becoming more common to have multiple fetch-decode- dispatch pipes
feeding the functional units.
The choice of approach depends on tradeoffs of the average execute time vs. the speed with
which instructions can be issued. For example, if execution averages several cycles, and the
number of functional units is small, then a single pipe may be able to keep the units utilized.
When the number of functional units grows large and/or their execution time approaches the
issue time, then multiple issue pipes may be necessary.
Having multiple issue pipes requires
Reordering may be either static (compiler) or dynamic (using hardware lookahead). It can be
difficult to combine the two approaches because the compiler may not be able to predict the
actions of the hardware reordering mechanism.
Superscalar operation is limited by the number of independent operations that can be extracted
from an instruction stream. It has been shown in early studies on simpler processor models, that
this is limited, mostly by branches, to a small number (<10, typically about 4).
More recent work has shown that, with speculative execution and aggressive branch prediction,
higher levels may be achievable. On certain highly regular codes, the level of parallelism may
be quite high (around 50). Of course, such highly regular codes are just as amenable to other
forms of parallel processing that can be employed more directly, and are also the exception
rather than the rule. Current thinking is that about 6-way instruction level parallelism for a
typical program mix may be the natural limit, with 4-way being likely for integer codes.
Potential ILP may be three times this, but it will be very difficult to exploit even a majority of
this parallelism. Nonetheless, obtaining a factor of 4 to 6 boost in performance is quite
significant, especially as processor speeds approach their limits.
Going beyond a single instruction stream and allowing multiple tasks (or threads) to operate at
the same time can enable greater system throughput. Because these are naturally independent at
the fine-grained level, we can select instructions from different streams to fill pipeline slots that
would otherwise go vacant in the case of issuing from a single thread. In turn, this makes it
useful to add more functional units. We shall further explore these multithreaded architectures
later in the course.
Scoreboard
A scoreboard is a centralized table that keeps track of the instructions to be performed and the
available resources and issues the instructions to the functional units when everything is ready
for them to proceed. As the instructions execute, dependences are checked and execution is
stalled as necessary to ensure that in-order semantics are preserved. Out of order execution is
possible, but is limited by the size of the scoreboard and the execution rules. The scoreboard can
be thought of as preceding dispatch, but it also controls execution after the issue. In a
scoreboarded system, the results can be forwarded directly to their destination register (as long
as there are no write after read hazards, in which case their execution is stalled), rather than
having to proceed to a final write-back stage.
In the CDC scoreboard, each register has a matching Result Register Designator that indicates
which functional unit will write a result into it. The fact that only one functional unit can be
designated for writing to a register at a time ensures that WAW dependences cannot occur. Each
functional unit also has a corresponding set of Entry-Operand Register Designators that indicate
what register will hold each operand, whether the value is valid (or pending) and if it is pending,
what functional unit will produce it (to facilitate forwarding). None of the operands is released
to a functional unit until they are all valid, precluding RAW dependences. In addition , the
scoreboard stalls any functional unit whose result would write a register that is still listed as an
Entry-Operand to a functional unit that is waiting for an operand or is busy, thus avoiding WAR
violations. An instruction is only allowed to issue if its specified functional unit is free and its
result register is not reserved by another functional unit that has not yet completed. Four Stages
of Scoreboard Control
1. Issue—decode instructions & check for structural hazards (ID1) If a functional unit
for the instruction is free and no other active instruction has the same destination register
(WAW), the scoreboard issues the instruction to the functional unit and updates its internal data
structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further
instructions will issue until these hazards are cleared.
2. Read operands—wait until no data hazards, then read operands (ID2) A source
operand is available if no earlier issued active instruction is going to write it, or if the register
containing the operand is being written by a currently active functional unit. When the source
operands are available, the scoreboard tells the functional unit to proceed to read the operands
from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in
this step, and instructions may be sent into execution out of order.
3. Execution—operate on operands (EX) The functional unit begins execution upon
receiving operands. When the result is ready, it notifies the scoreboard that it has completed
execution.
4. Write result—finish execution (WB) Once the scoreboard is aware that the functional
unit has completed execution, the scoreboard checks for WAR hazards.
If none, it writes results. If WAR, then it stalls the instruction. Example:
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD reads operands
Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in
2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each
functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready and not yet read. Set to
No after operands are read.
3. Register result status—Indicates which functional unit will write each register, if one
exists. Blank when no pending instructions will write that register
Scoreboard Implications
• provide solution for WAR, WAW hazards
• Solution for WAR – Stall Write in WB to allow Reads to take place; Read registers only
during Read Operands stage.
• For WAW, must detect hazard: stall in the Issue stage until other completes
• Need to have multiple instructions in execution phase
• Scoreboard keeps track of dependencies, state or operations
– Monitors every change in the hardware.
– Determines when to read ops, when can execute, when can wb.
– Hazard detection and resolution is centralized.
Reservation Stations The reservation station approach releases instructions directly to a pool of
buffers associated with their intended functional units (if more than one unit of a particular type
is present, then the units may share a single station). The reservation stations are a distributed
resource, rather than being centralized, and can be thought of as following dispatch. A
reservation is a record consisting of an instruction and its requirements to execute
-- its operands as specified by their sources and destination and bits indicating when valid
values are available for the sources. The instruction is released to the functional unit when its
requirements are satisfied, but it is important to note that satisfaction doesn't require an
operand to actually be in a register -- it can be forwarded to the reservation station for
immediate release or to be buffered (see below) for later release. Thus, the reservation station's
influence on execution can be thought of as more implicit and data dependent than the explicit
control exercised by the scoreboard.
Tomasulo Algorithm
The hardware dependence resolution technique used For IBM 360/91 about 3 years after
CDC 6600. Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station free, then issue instruction & send operands (renames registers).
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units; mark reservation station available.
Here the storage of operands resulting from instructions that completed out of order is done
through renaming of the registers. There are two mechanisms commonly used for renaming.
One is to assign physical registers from a free pool to the logical registers as they are identified
in an instruction stream. A lookup table is then used to map the logical register references to
their physical assignments. Usually the pool is larger than the logical register set to allow for
temporary buffering of results that are computed but not yet ready to write back. Thus, the
processor must keep track of a larger set of register names than the instruction set architecture
specifies. When the pool is empty, instruction issue stalls.
The other mechanism is to keep the traditional association of logical and physical registers, but
then provide additional buffers either associated with the reservation stations or kept in a central
location. In either case, each of these "reorder buffers" is associated with a given instruction,
and its contents (once computed) can be used in forwarding operations as long as the instruction
has not completed.
When an instruction reaches the point that it may complete in a manner that preserves sequential
semantics, then its reservation station is freed and its result appears in the logical register that
was originally specified. This is done either by renaming the temporary register to be one of the
logical registers, or by transferring the contents of the reorder buffer to the appropriate physical
register.
Out-of-Order Issue
To enable out-of-order dispatch of instructions to the pipelines, we must provide at least two
reservation stations per pipe that are available for issue at once. An alternative would
be to rearrange instructions in the prefetch buffer, but without knowing the status of the pipes, it
would be difficult to make such a reordering effective. By providing multiple reservation
stations, however, we can continue issuing instructions to pipes, even though an instruction may
be stalled while awaiting resources. Then, whatever instruction is ready first can enter the pipe
and execute. At the far end of the pipeline, the out-of-order instruction must wait to be retired in
the proper order. This necessitates a mechanism for keeping track of the proper order of
instructions (note that dependences alone cannot guarantee that instructions will be properly
reordered when they complete).
Superscalar-Superpipeline
Of course we may also combine superscalar operation with superpipelining. The result is
potentially the product of the speedup factors.
However, it is even more difficult to interlock between parallel pipes that are divided into many
stages. Also, the memory subsystem must be able to sustain a level of instruction throughput
corresponding to the total throughput of the multiple pipelines -- stretching the
processor/memory performance gap even more. Of course, with so many pipes and so many
stages, branch penalties become huge, and branch prediction becomes a serious bottleneck
(offsetting this somewhat is the potential for speculative branch execution that arises with a
large number of pipes).
But the real problem may be in finding the parallelism required to keep all of the pipes and
stages busy between branches. Consider that a machine with 12 pipelines of 20 stages must
always have access to a window of 240 instructions that are scheduled so as to avoid all
hazards, and that the average of 40 branches that would be present in a block of that size are all
correctly predicted sufficiently in advance to avoid stalling in the prefetch unit. This is a tall
order, even for highly regular code with a good balance of floating point to integer operations,
let alone irregular code that is typically unbalanced in its operation mix. As usual, scientific
code and image processing with their regular array operations often provide the best
performance for a super-super processor. However, a vector unit could more directly address the
array processing problems, and what is left to the scalar unit may not benefit nearly so
greatly from all of this complexity. Scalar processing can certainly benefit from ILP in many
cases, but probably in patterns that differ from vector processing.
References/Suggested readings