Techopedia Explains: Amdahl's Law
Techopedia Explains: Amdahl's Law
Techopedia Explains: Amdahl's Law
Advantages:
Global address space provides a user-friendly programming perspective to memory
Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs
Disadvantages:
Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can
geometrically increases traffic on the shared memory-CPU path, and for cache coherent systems,
geometrically increase traffic associated with cache/memory management.
Programmer responsibility for synchronization constructs that ensure "correct" access of global
memory.
Expense: it becomes increasingly difficult and expensive to design and produce shared memory
machines with ever increasing numbers of processors.
Speedup is limited by the total time needed for the sequential (serial) part of the
program. For 10 hours of computing, if we can parallelize 9 hours of computing and 1
hour cannot be parallelized, then our maximum speedup is limited to 10x.
The control of pipeline processors has similar issues to the control of multicycle
datapaths. Pipelining leaves the meaning of the nine control lines unchanged, that is,
those lines which controlled the multicycle datapath. In pipelining, we set control
lines (to defined values) in each stage for each instruction. This is done in hardware
by extending pipeline registers to include control information and circuitry.
Observe that there is nothing to control during instruction fetch and decode (IF and
ID). Thus, we can begin our control activities (initialization of control signals) during
ID, since control will only be exerted during EX, MEM, and WB stages of the
pipeline. Recalling that the various stages of control and buffer circuitry between the
pipeline stages are labelled IF/ID, ID/EX, EX/MEM, and MEM/WB, we have the
propagation of control shown in Figure 5.5.
Figure 5.5. Propagation of control through the EX, MEM, and WB states of the MIPS
pipelined datapath [Maf01,MK98].
IF/ID: Initializes control by passing the rs, rd, and rt fields of the instruction,
together with the opcode and funct fields, to the control circuitry.
ID/EX: Buffers control for the EX, MEM, and WB stages, while executing
control for the EX stage. Control decides what operands will be input to the
ALU, what ALU operation will be performed, and whether or not a branch is to
be taken based on the ALU Zero output.
EX/MEM: Buffers control for the MEM and WB stages, while executing
control for the MEM stage. The control lines are set for memory read or write,
as well as for data selection for memory write. This stage of control also
contains the branch control logic.
MEM/WB: Buffers and executes control for the WB stage, and selects the value
to be written into the register file.
Figure 5.6 shows how the control lines (red) are arranged on a per-stage basis, and
how the stage-specific control signals are buffered and passed along to the next
applicable stage.
Figure 5.6. Propagation of control through the EX, MEM, and WB states of the MIPS
pipelined datapath [Maf01,MK98].
Reading Assigment: Study the propagation of control signals for the example program
given on p. 471 of the textbook, which is illustrated stepwise on pp. 472-476 of the
textbook.
Pipeline Hazards
Pipeline hazards are situations that prevent the next instruction in the instruction stream
from executing during its designated clock cycles. Any condition that causes a stall in
the pipeline operations can be called a hazard. There are primarily three types of
hazards:
i. Data Hazards
ii. Control Hazards or instruction Hazards
iii. Structural Hazards.
Data Hazards
A data hazard is any condition in which either the source or the destination operands of an
instruction are not available at the time expected in the pipeline. As a result of which some
operation has to be delayed and the pipeline stalls. There are mainly three types of data
hazards:
RAW (Read after Write) – RAW hazard appears when the next instruction tries to read
from a source before the previous instruction writes to it.
B tries to read a register before A has written it and gets the old
value.
WAR (Write after Read) - WAR hazard appears when the next instruction writes to a
destination before the previous instruction reads it.
WAW (Write after Write) - WAW hazard appears when the next instruction tries to write
to a destination before the previous instruction writes to it and it results in changes done
in the wrong order.
Control hazards:
The instruction fetch unit of the CPU is responsible for providing a stream of instructions
to the execution unit. The instructions fetched by the fetch unit are in consecutive
memory locations and they are executed.
However the problem arises when one of the instructions is a branching instruction to
some other memory location. Thus all the instruction fetched in the pipeline from
consecutive memory locations are invalid now and need to removed(also called flushing
of the pipeline).This induces a stall till new instructions are again fetched from the
memory address specified in the branch instruction.
Thus the time lost as a result of this is called a branch penalty. Often dedicated
hardware is incorporated in the fetch unit to identify branch instructions and compute
branch addresses as soon as possible and reducing the resulting delay as a result.
Structural Hazards occur when different instructions collide while trying to access
the same piece of hardware in the same segment of a pipeline. This type of hazard
can be alleviated by having redundant hardware for the segments wherein the
collision occurs. Occasionally, it is possible to insert stalls or reorder instructions to
omit this type of hazard.
Structural Hazards:
This situation arises mainly when two instructions require a given hardware resource at
the same time and hence for one of the instructions the pipeline needs to be stalled.
The most common case is when memory is accessed at the same time by two
instructions. One instruction may need to access the memory as part of the Execute or
Write back phase while other instruction is being fetched. In this case if both the
instructions and data reside in the same memory. Both the instructions can’t proceed
together and one of them needs to be stalled till the other is done with the memory
access part. Thus in general sufficient hardware resources are needed for avoiding
structural hazards.
We next examine hazards in detail, and discuss several techniques for eliminating or
relieving hazards.
Definition. A data hazard occurs when the current instruction requires the result of a
preceding instruction, but there are insufficient segments in the pipeline to compute
the result and write it back to the register file in time for the current instruction to read
that result from the register file.
Problem: The first instruction (sub), starting on clock cycle 1 (CC1) completes
on CC5, when the result in Register 2 is written to the register file. If we did
nothing to resolve data dependencies, then no instruction that read Register 2
from the register file could read the "new" value computed by the sub
instruction until CC5. The dependencies in the other instructions are illustrated
by solid lines with arrowheads. If register read and write cannot occur within
the same clock cycle (we will see how this could happen in Section 5.3.4), then
only the fifth instruction (sw) can access the contents of register 2 in the
manner indicated by the flow of sequential execution in the MIPS code
fragment shown previously.
For example, consider the data dependency between the first and fourth
instructions (sub and add) of the example in Section 5.3.3. Here, a register file
write and a register file read are scheduled in CC5. This can be resolved by (a)
duplicating hardware, or (b) modifying the existing hardware to support
concurrent operations. If we duplicated the register file, then we could perform
concurrent read and write operations, but there would be a consistency
problem. That is, at a given clock cycle, registers in one register file could have
different values than the corresponding registers in the other register file. This
inconsistency is clearly unacceptable if accurate computation is to be
maintained.
Instead, we can modify the register file so that it (1) performs register write on
the first half of the clock cycle and (2) performs register read on the second half
of the clock cycle. In earlier hardware, designers sometimes inserted a delay
between write and read that was very small in relation to the clock cycle time,
in order to ensure convergence of the register file write.
Other structural hazards could occur during the branch instruction, if there were
not two ALUs in the EX segment of the pipeline. That is, with only one ALU,
we would have to simultaneously compute the BTA and determine (via
subtraction) whether or not the branch condition was fulfilled. This would not
be possible without two concurrent adders in the ALU, which is what we
currently have in our MIPS pipeline design shown in Figure 5.4.
A further structural hazard could occur if we only used one memory for both
instructions and data. For example, in Figure 5.7, suppose the sub instruction
was instead a sw instruction. Then, we would be writing to data memory in CC4
for Instruction #1 and reading from instruction memory in CC4 for Instruction
#4. Clearly, if there was only one memory there would be a conflict.
Similar to the problem with the concurrent reads and writes on the register file,
there are two ways to solve this dilemma. First, we can design a special-
purpose memory module that permits writing (reading) on the first (resp.
second) half of the clock cycle, as we said could be done with the register file.
However, this requires special (expensive) hardware. Second, we can use two
fast caches, one for instructions, and one for data, that access a large, slower
main memory in which instructions and data are both stored. The latter method
is used in practice because caches and main memory already exist, and the
memory management hardware for these types of components also exists.
Thus, we can use off-the-shelf hardware to solve a problem that would
otherwise require special-purpose development of expensive hardware.
Although this might not be as much fun as developing new hardware, it is more
cost-effective, which matters when one is designing and producing computers
for profit.
Control hazards are the most difficult types of hazards arising from normal
operation of a program. In the next section, we will see that exceptions (e.g.,
overflow) can play particularly interesting types of havoc with smooth pipeline
execution.
The most common type of control hazard is the branch instruction, which has
two alternative results: (1) jump to the branch target address if the instruction
succeeds, or (2) execute the instruction after the branch (at PC+4 of instruction
memory) if the branch fails.
The problem with the branch instruction is that we usually do not know which
result will occur (i.e., whether or not the branch will be taken) until the branch
condition is computed. Often, the branch condition depends on the result of the
preceding instruction, so we cannot precompute the branch condition to find
out whether or not the branch will be taken.
The most advantageous situation is one where the branch condition does not
depend on instructions immemdiately preceding it, as shown in the following
code fragment:
add $5, $5, $6 # One of the registers for beq comparison is
modified
sub $4, $3, $6 # Nothing important to the branch here
and $7, $8, $6 # Nothing important to the branch here
and $9, $6, $6 # Nothing important to the branch here
beq $5, $6, target
Here, the branch compares Registers 5 and 6, which are last modified in
the add instruction. We can therefore precompute the branch condition as sub r
$5, $6, where r denotes a destination register. If r = 0, then we know the branch
will be taken, and the runtime module (pipeline loader) can schedule the jump
to the branch target address with full confidence that the branch will be taken.
Thus, it makes good sense to assume that a branch will jump to the place that it
jumped to before. However, in dense decision structures (e.g., nested or
cascaded if statements), this situation does not always occur. In such cases, one
might not be able to tell from the preceding branch whether or not the
branching behavior will be repeated. It is then reasonable to use a multi-branch
lookahead.
It is this instruction (I) that is called the branch delay slot (BDS). In the BDS
can be safely placed an instruction that does not have data dependencies with
respect to (a) the branch condition, (b) the instruction following the branch
condition, or (c) the branch target. This ensures that, when the instruction J is
executed (J is the instruction to which control is transferred after the branch
condition is evaluated, whether J is pointed to by PC+4 or BTA), then the
instruction I will have been executed previously, and the pipe will not have a
stall where I would have been. As a result, the pipe will remain full throughout
the branch evaluation and execution, unless an exception occurs.
1. Hardware detects an exception (e.g., overflow in the ALU) and stops the
offending instruction at the EX stage.
2. Pipeline loader and scheduler allow all prior instructions (e.g., those
already in the pipeline in MEM and WB) to complete.
3. All instructions that are present in the pipeline after the exception is
detected are flushed from the pipeline.
4. The address of the offending instruction (usually the address in main
memory) is saved in the EPC register, and a code describing the
exception is saved in the Cause register.
5. Hardware control branches to the exception handling routine (part of the
operating system).
6. The exception handler performs one of three actions: (1) notify the user
of the exception (e.g., divide-by-zero or arithmetic-overflow) then
terminate the program; (2) try to correct or mitigate the exception then
restart the offending instruction; or (3) if the exception is a benign
interrupt (e.g., an I/O request), then save the program/pipeline state,
service the interrupt request, then restart the program at the instruction
pointed to by EPC + 4.
In general, we can say that, if a pipeline has N segments, and the EX stage is at
segment 1 < i < N, then two observations are key to the prediction of pipeline
performance:
It is readily seen that the total number of wasted cycles equals (i-1) + (N-i) = N
- 1, which is precisely the number of cycles that it takes to set up or reload the
pipeline.
If Step 3 can be performed as stated, then the best-case penalty is only one
cycle, plus the time incurred by executing the exception handler. If the entire
pipeline needs to be flushed and restarted, then the worst-case penalty is N
cycles incurred by flushing the pipe, then reloading the pipeline after the
instructions preceding the offending instruction have been executed. If the
offending instruction must be restarted, then a maximum of i cycles are lost
(one cycle for flush, plus (i-1) cycles to restart the instructions in the pipe
following the offending instruction).
In the next section, we collect the concepts about pipeline performance that we
have been discussing, and show how to compute the CPI for a pipeline
processor under constraint of stalls, structural hazards, branch penalties, and
exception penalties.
Reservation Table: Displays the time space flow of data through the pipeline for one
function evaluation.
Time
1 2 3 4 5 6 7 8
X X X
X X
X X X
S1
Stage S2
S3
Reservation function for a function x
Latency:The number of time units (clock cycles) between two initiations of a pipeline is the latency
between them. Latency values must be non-negative integers.
Collision:When two or more initiations are done at same pipeline stage at the same time will cause a
collision. A collision implies resource conflicts between two initiations in the pipeline, so it should be
avoided.
Forbidden Latency : Latencies that cause collisions are called forbidden latencies. (E.g. in above
reservation table 2, 4, 5 and 7 are forbidden latencies).
Permissible Latency : Latencies that do not cause any collision are called permissible latencies.
(E.g. in above reservation table 1, 3 and 6 are permissible latencies).
Latency Sequence and Latency Cycle: A Latency Sequence is a sequence of permissible non-
forbidden latencies between successive task initiations.
A Latency cycle is a latency sequence which repeats the same subsequence (cycle)
indefinitely.
The Average Latency of a latency cycle is obtained by dividing the sum of all latencies by the
number of latencies along the cycle.
A Constant Cycle is a latency cycle which contains only one latency value. (E.g. Cycles (3) and
(6) both are constant cycle).
Collision Vector or Initial Collision Vector : The combined set of permissible and forbidden
latencies can be easily displayed by a collision vector, which is an m-bit (m<=n-1 in a n column
reservation table) binary vector C=(CmCm-1….C2C1). The value of Ci=1 if latency I causes a collision
and Ci=0 if latency i is permissible. (E.g. Cx= (1011010)).
State Diagram : Specifies the permissible state transitions among successive initiations based
on the collision vector.
The minimum latency edges in the state diagram are marked with asterisk.
Simple Cycle, Greedy Cycle and MAL: A Simple Cycle is a latency cycle in which each state
appears only once. In above state diagram only (3), (6), (8), (1, 8), (3, 8), and (6, 8) are simple
cycles. The cycle(1, 8, 6, 8) is not simple as it travels twice through state (1011010).
A Greedy Cycle is a simple cycle whose edges are all made with minimum latencies from their
respective starting states. The cycle (1, 8) and (3) are greedy cycles.
MAL (Minimum Average Latency) is the minimum average latency obtained from the greedy
cycle. In greedy cycles (1, 8) and (3), the cycle (3) leads to MAL value 3.
State Diagram
1) Start with the ICV (Initial collision vector).
2) For each unprocessed state, For each bit i in the CVi which is 0, do the following:
i) Shift CVi right by i bits
ii) i find from rightmost bits.
iii) Append zeros to left
iv) Logically OR with ICV
v) Performed shift operation on last pass of maximum permissible latency either this place 0
present or not.
vi) If step(d) results in a new state then form a new node for this state and join it with node
of CVi by an arc with a marking i.
vii) This shifting process needs to continue until no more new states can be generated.
Greedy Cycle: A simple cycle is a greedy cycle if each latency contained in a cycle is the minimal
latency(outgoing arc) from a state in the cycle.Complex Cycle: consists of more than one simple
cycle in it.
1.
The state with all zeros has a self-loop which corresponds to empty pipeline and it is possible to
wait for indefinite number of latency cycles of the form (7),(8), (9),(10) etc. Method of finding
Cycle
2. 18. Simple cycles & The Greedy cycle is (1,3,3). Minimum Average Latency(MAL):
Scheduling strategy that achieve maximum Throughput without collision. In the above
example, the cycle that offers MAL is (1, 3, 3) (MAL = (1+3+3)/3 =2.333)The simple cycles
are (3),(5) ,(1,3,3),(4,3) and (4) Greedy cycles