Techopedia Explains: Amdahl's Law

Shared memory parallel computers vary widely, but generally have in common the ability for all
processors to access all memory as global address space.

Multiple processors can operate independently but share the same memory resources.
Changes in a memory location effected by one processor are visible to all other processors.
Shared memory machines can be divided into two main classes based upon memory access times: UMA
and NUMA.
Advantages:
Global address space provides a user-friendly programming perspective to memory
Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs
Disadvantages:
Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can
geometrically increases traffic on the shared memory-CPU path, and for cache coherent systems,
geometrically increase traffic associated with cache/memory management.
Programmer responsibility for synchronization constructs that ensure "correct" access of global
memory.
Expense: it becomes increasingly difficult and expensive to design and produce shared memory
machines with ever increasing numbers of processors.
Amdahl’s law is a formula used to find the maximum improvement improvement

possible by improving a particular part of a system. In parallel computing, Amdahl's law
is mainly used to predict the theoretical maximum speedup for program processing
using multiple processors. It is named after Gene Amdahl, a computer architect from
IBM and the Amdahl Corporation.
This term is also known as Amdahl’s argument.
[WEBINAR] How to Optimize Use of the Hybrid Cloud
Techopedia explains Amdahl's Law

Amdahl’s law states that in parallelization, if P is the proportion of a system or program
that can be made parallel, and 1-P is the proportion that remains serial, then the
maximum speedup that can be achieved using N number of processors is 1/((1-
P)+(P/N).
If N tends to infinity then the maximum speedup tends to 1/(1-P).
Speedup is limited by the total time needed for the sequential (serial) part of the
program. For 10 hours of computing, if we can parallelize 9 hours of computing and 1
hour cannot be parallelized, then our maximum speedup is limited to 10x.
Shared Memory Pros and Cons

• Pros
– Communication happens automatically
– More natural way of programming
• Easier to write correct programs and gradually optimize them
– No need to manually distribute data
(but can help if you do)
• Cons
– Needs more hardware support
– Easy to write correct, but inefficient programs
(remote accesses look the same as local ones)
Pipeline Control and Hazards
Reading Assignments and Exercises
The control of pipeline processors has similar issues to the control of multicycle
datapaths. Pipelining leaves the meaning of the nine control lines unchanged, that is,
those lines which controlled the multicycle datapath. In pipelining, we set control
lines (to defined values) in each stage for each instruction. This is done in hardware
by extending pipeline registers to include control information and circuitry.
5.3.1. Pipeline Control Issues and Hardware
Observe that there is nothing to control during instruction fetch and decode (IF and
ID). Thus, we can begin our control activities (initialization of control signals) during
ID, since control will only be exerted during EX, MEM, and WB stages of the
pipeline. Recalling that the various stages of control and buffer circuitry between the
pipeline stages are labelled IF/ID, ID/EX, EX/MEM, and MEM/WB, we have the
propagation of control shown in Figure 5.5.
Figure 5.5. Propagation of control through the EX, MEM, and WB states of the MIPS
pipelined datapath [Maf01,MK98].
Here, the following stages perform work as specified:
 IF/ID: Initializes control by passing the rs, rd, and rt fields of the instruction,
together with the opcode and funct fields, to the control circuitry.
 ID/EX: Buffers control for the EX, MEM, and WB stages, while executing
control for the EX stage. Control decides what operands will be input to the
ALU, what ALU operation will be performed, and whether or not a branch is to
be taken based on the ALU Zero output.
 EX/MEM: Buffers control for the MEM and WB stages, while executing
control for the MEM stage. The control lines are set for memory read or write,
as well as for data selection for memory write. This stage of control also
contains the branch control logic.
 MEM/WB: Buffers and executes control for the WB stage, and selects the value
to be written into the register file.
Figure 5.6 shows how the control lines (red) are arranged on a per-stage basis, and
how the stage-specific control signals are buffered and passed along to the next
applicable stage.
Figure 5.6. Propagation of control through the EX, MEM, and WB states of the MIPS
pipelined datapath [Maf01,MK98].
Reading Assigment: Study the propagation of control signals for the example program
given on p. 471 of the textbook, which is illustrated stepwise on pp. 472-476 of the
textbook.
Pipeline Hazards
Pipeline hazards are situations that prevent the next instruction in the instruction stream
from executing during its designated clock cycles. Any condition that causes a stall in
the pipeline operations can be called a hazard. There are primarily three types of
hazards:
i. Data Hazards
ii. Control Hazards or instruction Hazards
iii. Structural Hazards.
Data Hazards
A data hazard is any condition in which either the source or the destination operands of an
instruction are not available at the time expected in the pipeline. As a result of which some
operation has to be delayed and the pipeline stalls. There are mainly three types of data
hazards:
RAW (Read after Write) – RAW hazard appears when the next instruction tries to read
from a source before the previous instruction writes to it.
B tries to read a register before A has written it and gets the old
value.
WAR (Write after Read) - WAR hazard appears when the next instruction writes to a
destination before the previous instruction reads it.
B tries to write a register before A has read it.

In this case, A uses the new (incorrect) value.
WAW (Write after Write) - WAW hazard appears when the next instruction tries to write
to a destination before the previous instruction writes to it and it results in changes done
in the wrong order.
B tries to write an operand before A has written it.

After instruction B has executed, the value of the register should
be B's result, but A's result is stored instead.
This can only happen with pipelines that write values in more
than one stage, or in variable-length pipelines
Control Hazards can result from branch instructions. Here, the branch target address
might not be ready in time for the branch to be taken, which results in stalls (dead
segments) in the pipeline that have to be inserted as local wait events, until processing
can resume after the branch target is executed. Control hazards can be mitigated
through accurate branch prediction (which is difficult), and by delayed
branch strategies.
Control hazards:
The instruction fetch unit of the CPU is responsible for providing a stream of instructions
to the execution unit. The instructions fetched by the fetch unit are in consecutive
memory locations and they are executed.
However the problem arises when one of the instructions is a branching instruction to
some other memory location. Thus all the instruction fetched in the pipeline from
consecutive memory locations are invalid now and need to removed(also called flushing
of the pipeline).This induces a stall till new instructions are again fetched from the
memory address specified in the branch instruction.
Thus the time lost as a result of this is called a branch penalty. Often dedicated
hardware is incorporated in the fetch unit to identify branch instructions and compute
branch addresses as soon as possible and reducing the resulting delay as a result.
Structural Hazards occur when different instructions collide while trying to access
the same piece of hardware in the same segment of a pipeline. This type of hazard
can be alleviated by having redundant hardware for the segments wherein the
collision occurs. Occasionally, it is possible to insert stalls or reorder instructions to
omit this type of hazard.
Structural Hazards:
This situation arises mainly when two instructions require a given hardware resource at
the same time and hence for one of the instructions the pipeline needs to be stalled.
The most common case is when memory is accessed at the same time by two
instructions. One instruction may need to access the memory as part of the Execute or
Write back phase while other instruction is being fetched. In this case if both the
instructions and data reside in the same memory. Both the instructions can’t proceed
together and one of them needs to be stalled till the other is done with the memory
access part. Thus in general sufficient hardware resources are needed for avoiding
structural hazards.
We next examine hazards in detail, and discuss several techniques for eliminating or
relieving hazards.
5.3.3. Data Hazards
Definition. A data hazard occurs when the current instruction requires the result of a
preceding instruction, but there are insufficient segments in the pipeline to compute
the result and write it back to the register file in time for the current instruction to read
that result from the register file.
We typically remedy this problem in one of three ways:
 Forwarding: In order to resolve a dependency, one adds special circuitry to the

pipeline that is comprised of wires and switches with which one forwards or
transmits the desired value to the pipeline segment that needs that value for
computation. Although this adds hardware and control circuitry, the method
works because it takes far less time for the required value(s) to travel through a
wire than it does for a pipeline segment to compute its result.
 Code Re-Ordering: Here, the compiler reorders statements in the source code,
or the assembler reorders object code, to place one or more statements between
the current instruction and the instruction in which the required operand was
computed as a result. This requires an "intelligent" compiler or assembler,
which must have detailed information about the structure and timing of the
pipeline on which the data hazard would occur. We call this type of software
a hardware-dependent compiler.
 Stall Insertion: It is possible to insert one or more stalls (no-op instructions)
into the pipeline, which delays the execution of the current instruction until the
required operand is written to the register file. This decreases pipeline
efficiency and throughput, which is contrary to the goals of pipeline processor
design. Stalls are an expedient method of last resort that can be used when
compiler action or forwarding fails or might not be supported in hardware or
software design.
The following example is illustrative.
Example. Suppose we have the following sequence of instructions:

sub $2, $1, $3 # Register 2 is the output of sub
and $8, $2, $5 # Operand #1 depends on Register 2 data
or $9, $6, $2 # Operand #2 depends on Register 2 data
add $7, $2, $2 # Add result depends on Register 2 data
sw $6,20($2) # Store (memory write) depends on Register
2
whose pipeline scheduling diagram is shown in Figure 5.7.
Figure 5.7. Example of data hazards in a sequence of MIPS instructions, where

the red (blue) arrows indicate dependencies that are problematic (not
problematic) [Pat98,MK98].
Problem: The first instruction (sub), starting on clock cycle 1 (CC1) completes
on CC5, when the result in Register 2 is written to the register file. If we did
nothing to resolve data dependencies, then no instruction that read Register 2
from the register file could read the "new" value computed by the sub
instruction until CC5. The dependencies in the other instructions are illustrated
by solid lines with arrowheads. If register read and write cannot occur within
the same clock cycle (we will see how this could happen in Section 5.3.4), then
only the fifth instruction (sw) can access the contents of register 2 in the
manner indicated by the flow of sequential execution in the MIPS code
fragment shown previously.
Solution #1 - Forwarding: The result generated by the sub instruction can

be forwarded to the other stages of the pipeline using special control circuitry
(data bus switchable to any other segment, which can be implemented via a
decoder or crossbar switch). This is indicated notionally in Figure 5.7 by solid
red lines with arrowheads. If the register file can read in the first half of a cycle
and write in the second half of a cycle, then the forwarding in CC5 is not
problematic. Otherwise, we would have to delay the execution of
the add instruction by one clock cycle (see Figure 5.9 for insertion of a stall).
Solution #2 - Code Re-Ordering: Since all Instructions 2 through 5 in the MIPS

code fragment require Register 2 as an operand, we do not have instructions in
that particular code fragment to put between Instruction 1 and Instruction 2.
However, let us assume that we have other instructions that (a) do not depend
on the results of Instructions 1-5, and (b) themselves induce no dependencies in
Instructions 1-5 (e.g., by writing to register 1, 2, 3, 5, or 6). In that case, we
could insert two instructions between Instructions 1 and 2, if register read and
write could occur concurrently. Otherwise, we would have to insert three such
instructions. The latter case is illustrated in the following figure, where the
inserted instructions and their pipeline actions are colored dark green.
Figure 5.8. Example of code reordering to solve data hazards in a sequence of

MIPS instructions [Pat98,MK98].
Solution #3 - Stalls: Suppose that we had no instructions to insert between

Instructions 1 and 2. For example, there might be data dependencies arising
from the inserted instructions that would themselves have to be repaired.
Alternatively, the program execution order (functional dependencies) might not
permit the reordering of code. In such cases, we have to insert stalls, also
called bubbles, which are no-op instructions that merely delay the pipeline
execution until the dependencies are no longer problematic with respect to
pipeline timing. This is illustrated in Figure 5.9 by inserting three stalls
between Instructions 1 and 2.
Figure 5.9. Example of stall insertion to solve data hazards in a sequence of

MIPS instructions [Pat98,MK98].
As mentioned previously, the insertion of stalls is the least desireable technique

because it delays the execution of an instruction without accomplishing any
useful work (in contrast to code re-ordering).
5.3.4. Structural Hazards
Definition. A structural hazard occurs when there is insufficient hardware to

support a computation in a given pipeline segment.
For example, consider the data dependency between the first and fourth
instructions (sub and add) of the example in Section 5.3.3. Here, a register file
write and a register file read are scheduled in CC5. This can be resolved by (a)
duplicating hardware, or (b) modifying the existing hardware to support
concurrent operations. If we duplicated the register file, then we could perform
concurrent read and write operations, but there would be a consistency
problem. That is, at a given clock cycle, registers in one register file could have
different values than the corresponding registers in the other register file. This
inconsistency is clearly unacceptable if accurate computation is to be
maintained.
Instead, we can modify the register file so that it (1) performs register write on
the first half of the clock cycle and (2) performs register read on the second half
of the clock cycle. In earlier hardware, designers sometimes inserted a delay
between write and read that was very small in relation to the clock cycle time,
in order to ensure convergence of the register file write.
Other structural hazards could occur during the branch instruction, if there were
not two ALUs in the EX segment of the pipeline. That is, with only one ALU,
we would have to simultaneously compute the BTA and determine (via
subtraction) whether or not the branch condition was fulfilled. This would not
be possible without two concurrent adders in the ALU, which is what we
currently have in our MIPS pipeline design shown in Figure 5.4.
A further structural hazard could occur if we only used one memory for both
instructions and data. For example, in Figure 5.7, suppose the sub instruction
was instead a sw instruction. Then, we would be writing to data memory in CC4
for Instruction #1 and reading from instruction memory in CC4 for Instruction
#4. Clearly, if there was only one memory there would be a conflict.
Similar to the problem with the concurrent reads and writes on the register file,
there are two ways to solve this dilemma. First, we can design a special-
purpose memory module that permits writing (reading) on the first (resp.
second) half of the clock cycle, as we said could be done with the register file.
However, this requires special (expensive) hardware. Second, we can use two
fast caches, one for instructions, and one for data, that access a large, slower
main memory in which instructions and data are both stored. The latter method
is used in practice because caches and main memory already exist, and the
memory management hardware for these types of components also exists.
Thus, we can use off-the-shelf hardware to solve a problem that would
otherwise require special-purpose development of expensive hardware.
Although this might not be as much fun as developing new hardware, it is more
cost-effective, which matters when one is designing and producing computers
for profit.
5.3.5. Control (Branch) Hazards
Control hazards are the most difficult types of hazards arising from normal
operation of a program. In the next section, we will see that exceptions (e.g.,
overflow) can play particularly interesting types of havoc with smooth pipeline
execution.
The most common type of control hazard is the branch instruction, which has
two alternative results: (1) jump to the branch target address if the instruction
succeeds, or (2) execute the instruction after the branch (at PC+4 of instruction
memory) if the branch fails.
The problem with the branch instruction is that we usually do not know which
result will occur (i.e., whether or not the branch will be taken) until the branch
condition is computed. Often, the branch condition depends on the result of the
preceding instruction, so we cannot precompute the branch condition to find
out whether or not the branch will be taken.
The following four strategies are employed in resolving control dependencies

due to branch instructions.
5.3.5.1. Assume Branch Not Taken. As we saw previously, we can insert

stalls until we find out whether or not the branch is taken. However, this slows
pipeline execution unacceptably. A common alternative to stalling is to
continue execution of the instruction stream as though the branch was not
taken. The intervening instructions between the branch and its target are then
executed. If the branch is not taken, this is not a harmful or disruptive
technique. However, if the branch is taken, then we must discard the results of
the instructions executed after the branch statement. This is done by flushing
the IF, ID, and EX stages of the pipeline for the discarded instructions.
Execution continues uninterrupted after the branch target.
The cost of this technique is approximately equal to the cost of discarding

instructions. For example, if branches are not taken 50 percent of the time, and
the cost of discarding results is negligible, then this technique reduces by 50
percent the cost of control hazards.
5.3.5.2. Reducing Branch Delay. In the MIPS pipeline architecture shown

schematically in Figure 5.4, we currently assume that the branch condition is
evaluated in Stage 3 of the pipeline (EX). If we move the branch evaluation up
one stage, and put special circuitry in the ID (Decode, Stage #2), then we can
evaluate the branch condition for the beq instruction. This would allow us to
take the branch in EX instead of MEM, since we would have ready for Stage 3
(EX) the Zero output of the comparator that would normally be computed in
EX. The hardware needed to compute equality is merely a series of
parallel xnor gates and-ed together, then inverted.
Exercise: Determine how this type of circuitry could be configured.
The advantage of this technique is that only one instruction needs to be

discarded rather than two, as in the previous section. This reduces hardware
cost and time required to flush the pipeline, since only the IF and ID stages
would need to be flushed, instead of the three stages in the preceding example.
5.3.5.3. Dynamic Branch Prediction. It would be useful to be able to predict

whether or not a majority of the branches are taken or not taken. This can be
done in software, using intelligent compilers, and can also be done at runtime.
We concentrate on the software-intensive techniques first, since they are less
expensive to implement (being closer to the compiler, which is easier to modify
than the hardware).
The most advantageous situation is one where the branch condition does not
depend on instructions immemdiately preceding it, as shown in the following
code fragment:
add $5, $5, $6 # One of the registers for beq comparison is
modified
sub $4, $3, $6 # Nothing important to the branch here
and $7, $8, $6 # Nothing important to the branch here
and $9, $6, $6 # Nothing important to the branch here
beq $5, $6, target
Here, the branch compares Registers 5 and 6, which are last modified in
the add instruction. We can therefore precompute the branch condition as sub r
$5, $6, where r denotes a destination register. If r = 0, then we know the branch
will be taken, and the runtime module (pipeline loader) can schedule the jump
to the branch target address with full confidence that the branch will be taken.
Another approach is to keep a history of branch statements, and to record what

addresses these statements branch. Since the vast majority of branches are used
as tests of loop indices, then we know that the branch will almost always jump
to the loopback point. If the branch fails, then we know the loop is finished, and
this happens only once per loop. Since most loops are designed to have many
iterations, branch failure occurs less frequently in loops than does taking the
branch.
Thus, it makes good sense to assume that a branch will jump to the place that it
jumped to before. However, in dense decision structures (e.g., nested or
cascaded if statements), this situation does not always occur. In such cases, one
might not be able to tell from the preceding branch whether or not the
branching behavior will be repeated. It is then reasonable to use a multi-branch
lookahead.
Reading Assigment: Study the discussion of multi-bit branch prediction

schemes given on pp. 501-502 of the textbook.
Another clever technique of making branches more efficient is the branch

delay slot. We previously discussed delayed branches when we overviewed the
multicycle datapath implementation of the beq instruction. In summary, the
concept of efficient branching has two parts. First, the branch target address is
computed in the ID stage of the pipeline, to determine as early as possible the
instruction to fetch if the branch succeeds. Since this is done in the second
stage of the pipeline, there is an instruction I following this (in the first or IF
stage). After I moves to the ID stage, then the branch target (pointed to by
either PC+4 or the BTA) is loaded into the IF stage.
It is this instruction (I) that is called the branch delay slot (BDS). In the BDS
can be safely placed an instruction that does not have data dependencies with
respect to (a) the branch condition, (b) the instruction following the branch
condition, or (c) the branch target. This ensures that, when the instruction J is
executed (J is the instruction to which control is transferred after the branch
condition is evaluated, whether J is pointed to by PC+4 or BTA), then the
instruction I will have been executed previously, and the pipe will not have a
stall where I would have been. As a result, the pipe will remain full throughout
the branch evaluation and execution, unless an exception occurs.
5.3.6. Exceptions as Hazards
Hardware and software must work together in any architecture, especially in a

pipeline processor. Here, the ISA and processor control must be designed so
that the following steps occur when an exception is detected:
1. Hardware detects an exception (e.g., overflow in the ALU) and stops the
offending instruction at the EX stage.
2. Pipeline loader and scheduler allow all prior instructions (e.g., those
already in the pipeline in MEM and WB) to complete.
3. All instructions that are present in the pipeline after the exception is
detected are flushed from the pipeline.
4. The address of the offending instruction (usually the address in main
memory) is saved in the EPC register, and a code describing the
exception is saved in the Cause register.
5. Hardware control branches to the exception handling routine (part of the
operating system).
6. The exception handler performs one of three actions: (1) notify the user
of the exception (e.g., divide-by-zero or arithmetic-overflow) then
terminate the program; (2) try to correct or mitigate the exception then
restart the offending instruction; or (3) if the exception is a benign
interrupt (e.g., an I/O request), then save the program/pipeline state,
service the interrupt request, then restart the program at the instruction
pointed to by EPC + 4.
In any case, the pipeline is flushed as described.
In general, we can say that, if a pipeline has N segments, and the EX stage is at
segment 1 < i < N, then two observations are key to the prediction of pipeline
performance:
o Flushing negates the processing of the (i-1) instructions following the

offending instruction. These must be reloaded into the pipe, at the cost of
i cycles (one cycle to flush, i-1 cycles to reload the i-1
instructions after the exception is processed).
o Completing the N-i instructions that were loaded into the pipeline prior
to the offending instruction takes N-i clock cycles, which are executed
(a) prior to, or (b) concurrently with, the reloading of the instructions i-1
that followed the i-th instruction (in the EX stage).
It is readily seen that the total number of wasted cycles equals (i-1) + (N-i) = N
- 1, which is precisely the number of cycles that it takes to set up or reload the
pipeline.
The proliferation of unproductive cycles can be mitigated by the following

technique:
9. Freeze the pipeline state as soon as an exception is detected.

10. Process the exception via the exception handler, and decide whether or
not to halt or restart the pipeline.
11. If the pipeline is restarted, reload the (i-1) instructions following the
offending instruction, concurrently with completing execution of the (N-
i) instructions that were being processed prior to the offending
instruction.
If Step 3 can be performed as stated, then the best-case penalty is only one
cycle, plus the time incurred by executing the exception handler. If the entire
pipeline needs to be flushed and restarted, then the worst-case penalty is N
cycles incurred by flushing the pipe, then reloading the pipeline after the
instructions preceding the offending instruction have been executed. If the
offending instruction must be restarted, then a maximum of i cycles are lost
(one cycle for flush, plus (i-1) cycles to restart the instructions in the pipe
following the offending instruction).
In the next section, we collect the concepts about pipeline performance that we
have been discussing, and show how to compute the CPI for a pipeline
processor under constraint of stalls, structural hazards, branch penalties, and
exception penalties.
Reservation Table: Displays the time space flow of data through the pipeline for one
function evaluation.
Time
1 2 3 4 5 6 7 8
X X X
X X
X X X
S1
Stage S2
S3
Reservation function for a function x
Latency:The number of time units (clock cycles) between two initiations of a pipeline is the latency
between them. Latency values must be non-negative integers.
Collision:When two or more initiations are done at same pipeline stage at the same time will cause a
collision. A collision implies resource conflicts between two initiations in the pipeline, so it should be
avoided.
Forbidden Latency : Latencies that cause collisions are called forbidden latencies. (E.g. in above
reservation table 2, 4, 5 and 7 are forbidden latencies).
Permissible Latency : Latencies that do not cause any collision are called permissible latencies.
(E.g. in above reservation table 1, 3 and 6 are permissible latencies).
Latency Sequence and Latency Cycle: A Latency Sequence is a sequence of permissible non-
forbidden latencies between successive task initiations.
A Latency cycle is a latency sequence which repeats the same subsequence (cycle)
indefinitely.
The Average Latency of a latency cycle is obtained by dividing the sum of all latencies by the
number of latencies along the cycle.
The latency cycle (1, 8) has an average latency of (1+8)/2=4.5.
A Constant Cycle is a latency cycle which contains only one latency value. (E.g. Cycles (3) and
(6) both are constant cycle).
Collision Vector or Initial Collision Vector : The combined set of permissible and forbidden
latencies can be easily displayed by a collision vector, which is an m-bit (m<=n-1 in a n column
reservation table) binary vector C=(CmCm-1….C2C1). The value of Ci=1 if latency I causes a collision
and Ci=0 if latency i is permissible. (E.g. Cx= (1011010)).
State Diagram : Specifies the permissible state transitions among successive initiations based
on the collision vector.
The minimum latency edges in the state diagram are marked with asterisk.
Simple Cycle, Greedy Cycle and MAL: A Simple Cycle is a latency cycle in which each state
appears only once. In above state diagram only (3), (6), (8), (1, 8), (3, 8), and (6, 8) are simple
cycles. The cycle(1, 8, 6, 8) is not simple as it travels twice through state (1011010).
A Greedy Cycle is a simple cycle whose edges are all made with minimum latencies from their
respective starting states. The cycle (1, 8) and (3) are greedy cycles.
MAL (Minimum Average Latency) is the minimum average latency obtained from the greedy
cycle. In greedy cycles (1, 8) and (3), the cycle (3) leads to MAL value 3.
State Diagram
1) Start with the ICV (Initial collision vector).
2) For each unprocessed state, For each bit i in the CVi which is 0, do the following:
i) Shift CVi right by i bits
ii) i find from rightmost bits.
iii) Append zeros to left
iv) Logically OR with ICV
v) Performed shift operation on last pass of maximum permissible latency either this place 0
present or not.
vi) If step(d) results in a new state then form a new node for this state and join it with node
of CVi by an arc with a marking i.
vii) This shifting process needs to continue until no more new states can be generated.
Greedy Cycle: A simple cycle is a greedy cycle if each latency contained in a cycle is the minimal
latency(outgoing arc) from a state in the cycle.Complex Cycle: consists of more than one simple
cycle in it. 
1.
The state with all zeros has a self-loop which corresponds to empty pipeline and it is possible to
wait for indefinite number of latency cycles of the form (7),(8), (9),(10) etc. Method of finding
Cycle
2. 18. Simple cycles & The Greedy cycle is (1,3,3). Minimum Average Latency(MAL):
Scheduling strategy that achieve maximum Throughput without collision. In the above
example, the cycle that offers MAL is (1, 3, 3) (MAL = (1+3+3)/3 =2.333)The simple cycles
are (3),(5) ,(1,3,3),(4,3) and (4) Greedy cycles

Techopedia Explains: Amdahl's Law

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Techopedia Explains: Amdahl's Law

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Techopedia Explains: Amdahl's Law

Uploaded by

Copyright:

Available Formats

Shared memory parallel computers vary widely, but generally have in common the ability for all

processors to access all memory as global address space.

Amdahl’s law is a formula used to find the maximum improvement improvement

This term is also known as Amdahl’s argument.

[WEBINAR] How to Optimize Use of the Hybrid Cloud

Techopedia explains Amdahl's Law

Shared Memory Pros and Cons

Pipeline Control and Hazards

Reading Assignments and Exercises

5.3.1. Pipeline Control Issues and Hardware

Here, the following stages perform work as specified:

B tries to write a register before A has read it.

B tries to write an operand before A has written it.

5.3.3. Data Hazards

We typically remedy this problem in one of three ways:

 Forwarding: In order to resolve a dependency, one adds special circuitry to the

The following example is illustrative.

Example. Suppose we have the following sequence of instructions:

whose pipeline scheduling diagram is shown in Figure 5.7.

Figure 5.7. Example of data hazards in a sequence of MIPS instructions, where

Solution #1 - Forwarding: The result generated by the sub instruction can

Solution #2 - Code Re-Ordering: Since all Instructions 2 through 5 in the MIPS

Figure 5.8. Example of code reordering to solve data hazards in a sequence of

Solution #3 - Stalls: Suppose that we had no instructions to insert between

Figure 5.9. Example of stall insertion to solve data hazards in a sequence of

As mentioned previously, the insertion of stalls is the least desireable technique

5.3.4. Structural Hazards

Definition. A structural hazard occurs when there is insufficient hardware to

5.3.5. Control (Branch) Hazards

The following four strategies are employed in resolving control dependencies

5.3.5.1. Assume Branch Not Taken. As we saw previously, we can insert

The cost of this technique is approximately equal to the cost of discarding

5.3.5.2. Reducing Branch Delay. In the MIPS pipeline architecture shown

The advantage of this technique is that only one instruction needs to be

5.3.5.3. Dynamic Branch Prediction. It would be useful to be able to predict

Another approach is to keep a history of branch statements, and to record what

Reading Assigment: Study the discussion of multi-bit branch prediction

Another clever technique of making branches more efficient is the branch

5.3.6. Exceptions as Hazards

Hardware and software must work together in any architecture, especially in a

In any case, the pipeline is flushed as described.

o Flushing negates the processing of the (i-1) instructions following the

The proliferation of unproductive cycles can be mitigated by the following

9. Freeze the pipeline state as soon as an exception is detected.

The latency cycle (1, 8) has an average latency of (1+8)/2=4.5.

You might also like