Pipelining and Parallelism

Computer Architecture
UEC509-Part-3
Dr. Debabrata Ghosh
Assistant Professor, ECED
Thapar University
Pipelining
• Overlapping the execution of multiple instructions
• During each clock cycle a new instruction is initiated
Inst No Pipeline stages

i
i+1
i+2
i+3
i+4
Clock cycles 1 2 3 4 5 6 7 8 9
• During each clock cycle different parts of different instructions are executed. Each part is
called a pipeline stage or pipe stage
• Machine cycle: time to complete single pipeline stage. Depends on slowest pipeline stage
• Designer goal: balance the length of each pipeline stage
Pipelining
• Speedup = no of pipeline stages

• DO NOT use same datapath resource for two different operations during the same clock
cycle
• Pipeline reduces average execution time per instruction
Decrease in CPI
• Machine takes multiple CCs per instruction
DLX datapath in pipeline format
• Only major functional units shown in different CC: introduces few conflicts
3 important observations:
1) Two separate memories (IM and
Written DM) used to avoid conflict of resource
between IF of one instruction and
MEM of another in same CC
2) Same Reg (Register subsystem)
used in ID and WB stages.
Will there be conflict of resource
between ID of one instruction and WB
Read of another? YES. How to avoid?
In ID stage Reg is only read, whereas
in WB stage Reg is only written. Split
the CC into two halves: one half for
write and another half for read
3) PC is not shown. In unpipelined datapath, PC is written twice: once in IF stage and
once in MEM stage (for branch or jump). In pipelined datapath, to avoid two different
instructions writing two different values in PC, writing PC will be done only once, in IF
stage.
Pipeline registers/latches
• Each pipeline stage separated from and

connected to following stage using pipeline
registers
• At end of each CC, results of one pipeline
stage are stored in pipeline registers and
used by the next pipeline stage
• Buffer intermediate values to be used by
nonadjacent pipeline stages
• Has same role as temporary registers: NPC,
A, B, I, ALUout, LMD, Cond
• Carry both data and control from one
pipeline stage to another
• Result stored until it’s no longer needed
• Labelled as names of the stages they
connect and separate
Instruction flow within pipelined DLX:
LW R3, 50(R1)
Hardware stage 1: IF/ID. IR ← Mem [PC] Hardware stage 5: Read MEM/WB register
IF/ID. NPC ← PC + 4 MEM/WB. IR11…15 ← MEM/WB. LMD
Hardware stage 2: Read IF/ID register

ID/EX. A ← IF/ID. IR6…10
ID/EX. B ← IF/ID. IR11…15
ID/EX. I ← (IF/ID. IR16)16 ## IF/ID. IR16…31
ID/EX. NPC ← IF/ID. NPC (Not going to be used anymore for this instruction)
ID/EX.IR ← IF/ID. IR
Hardware stage 3: Read ID/EX register

EX/MEM. ALUout ← ID/EX. A + ID/EX. I
EX/MEM.IR ← ID/EX. IR
Hardware stage 4: Read EX/MEM register

MEM/WB. LMD ← Mem [EX/MEM. ALU out ]
MEM/WB.IR ← EX/MEM.IR
Performance using pipelining
• Pipeline increases execution time of an individual instruction : due to overhead
(increased hardware, delay from pipeline registers etc.)
• However, pipeline increases CPU throughput (no of instructions executed per
unit time). Program runs faster.
Assume a unpipelined machine has 10-ns clock cycles and it uses four cycles for
ALU operations and branches and five cycles for memory operations. Assume that
the relative frequencies of these operations are 40%, 20%, and 40%, respectively.
Suppose that due to clock skew and setup, pipelining the machine adds 1 ns of
overhead to the clock. Ignoring any latency impact, how much speedup in the
instruction execution rate will we gain from a pipeline?
(Assume pipeline is full)
Assume that the times required for the five functional units, which operate in each
of the five cycles, are as follows: 10 ns, 8 ns, 10 ns, 10 ns, and 7 ns. Assume that
pipelining adds 1 ns of overhead. How much speedup in the instruction execution
rate will be gained from a pipeline?
(Assume pipeline is full)
Assume a unpipelined machine takes 100 ns to execute an instruction. Assume
that the time required for each of the five functional units, which operate in each
of the five cycles, are 20 ns. Determine speedup ratio of the pipeline for executing
1000 instructions.
Pipeline hazards
• 3 types of problems (hazards) limit pipeline effectiveness
1) Structural hazards: caused due to conflict of datapath resources. Two
instructions trying to access same resource in same clock cycle
2) Data hazards: caused when an instruction depends on results of
previous instruction which is not yet available
3) Control hazards: caused when pipelining branch or other instructions
which change PC
BEQZ R4, Addr
If Regs [R4] == 0
PC ← PC + 4 + signext (Addr)
• When hazard happens, it is necessary to delay (stall) the pipeline

Pipeline hazards
• If instruction n requires stall (due to hazard) in the pipeline
• Instructions initiated after the nth instruction are also stalled
• Instructions initiated before the nth instruction are continued normally
Machine with single memory unit
Unaffected by stall
Affected by stall
Performance of pipeline with stalls
• Speedup using pipeline (ideal case, no stall) = number of pipeline
stages i.e. pipeline depth
• Speedup using pipeline =
= x
• Pipeline reduces average execution time per instruction
Pipeline
•Decrease in CPIdecreases CPI (Assume CC time unpipelined = CC time pipelined, ignore the CC time overhead due
to pipelining)
Performance of pipeline with stalls
• CPI pipelined ideal = 1 (i.e., no stall)
• CPI pipelined with stall = CPI pipelined ideal + CC of pipeline stall per instruction
Here stall is 1 CC

• In the simplest case, each instruction takes exactly same number of CC, which is equal to number of pipe stages or
pipeline depth
• Speedup using pipeline = = pipeline depth (if no stall)
Structural Hazards
• Causes due to conflict of datapath resources: 2 instructions trying to use the same resource at the same
time
• Example: Machine with single memory for both instruction fetching and data transfer
• Instruction containing data memory reference (ex. Load and store) will conflict with later instruction
containing instruction memory reference Same memory
Load instruction
• To resolve, stall the pipeline for one CC when a data memory reference instruction occurs
Find stalls due to structural hazards
Load instruction
Load instruction
How many stalls?

Structural Hazards
• Introducing stall/stalls degrades performance
• Why will designer allow structural hazards?
• To reduce cost: Duplicating all datapath resources cost a lot, not worth the cost if
structural hazards are only few
• To reduce latency: pipelining all the functional units to avoid hazard will introduce too
much latency from registers
Structural Hazards
Suppose data-reference instructions constitute 40% of all instructions. Ideal CPI of
pipelined machine is 1. Suppose we have two pipelined machine:
1) machine A without structural hazards
2) machine B with structural hazards.
Clock rate B = 1.05 Clock rate A.
Which machine is faster?

Consider a nonpipelined machine with 6 execution stages of lengths 50 ns, 50 ns, 60
ns, 60 ns, 50 ns, and 50 ns.
Find the instruction latency on this machine. How much time does it take to execute
100 instructions?
Suppose we introduce pipelining on this machine. Assume that when introducing

pipelining, the clock skew adds 5ns of overhead to each execution stage.
What is the instruction latency on the pipelined machine?
How much time does it take to execute 100 instructions?
What is the speedup obtained from pipelining?

DLX programming
• What's wrong with the following lines of code:
• ADDI R34, R5, R0

• SUB 0(R26), F4, R3
• J R11
• MULT R9, R8, R7
• LD F15, 0(R1)
DLX programming
1. Multiply two data words lying in consecutive memory location and store the result at
consecutive location in memory.
2. Tests to see if a value in R19 is equal to zero, and branches to a line labeled “loop” if it
is not
3. Jumps to an instruction which is pointed to by R16, and saves the address of the next
instruction (PC+4) in R31.
4. Multiplies two integers residing in R1 and R2 and stores the result at location 100
5. Load two double precision floats stored in consecutive memory locations, multiply
them, and store the result in consecutive memory location
6. Find the factorial of the number 10 and store the result at location 2000.
DLX programming
• Tests to see if a value in R19 is equal to zero, and branches to a line labeled "loop" if it is not
• Jumps to an instruction which is pointed to by R16, and saves the address of the next instruction in R31
• Multiplies two integers residing in R1 and R2 and stores the result.
Topics up to this will be in the MST
Data Hazards
• Pipelining:
• Overlapping execution of multiple instructions
• During each clock cycle, a new instruction is initiated
• Overlapping execution of multiple instructions by changing their relative timing
• Introduces data hazards
• Occurs when pipeline changes the order of read/write accesses so that the order is different
from what is seen by sequential execution of instructions
Data Hazards
Observations:
• All instructions after ADD instruction use result of ADD instruction

(written in R1)
• ADD instruction writes result in R1 in WB stage (CC5) Written
• SUB instruction reads R1 in ID stage (CC3): data hazard, reads and

uses wrong value of R1
• AND instruction reads R1 in ID stage (CC4): data hazard, reads and
uses wrong value of R1
• OR instruction reads R1 in ID stage (CC5). No data hazard
• ADD writes into register file in first half of CC5
• OR reads from register file in the second half of CC5
Read
• XOR instruction reads R1 in ID stage (CC6). No data hazard
Data Hazards
Data Hazards
• How many stalls needed to avoid data hazards?
• Pipeline stall clock cycles = 2 in order to avoid data hazards

• Forwarding/ Bypassing to eliminate stalls: hardware technique
Data Hazards: Forwarding
Key insight in forwarding:

• R1 is not really needed by SUB instruction until ADD instruction actually produces it
• ADD instruction produces result (R2+R3) in EX/MEM register
• SUB instruction needs result of ADD instruction in ALU input latch
• Supply the result to the SUB instruction when it needs it: done by forwarding
• No need for stall
Data Hazards: Forwarding
Forward from EX/MEM to ALU input
Forward from MEM/WB to ALU input

Forward through register file
How Forwarding works?

• If the forwarding hardware detects the
previous ALU instruction has written the
Current instruction register corresponding to the source for
the current instruction, control logic
selects forwarded result as the ALU
input rather than the value read from
register file
• Forward result not only from
immediately previous instruction, but
possibly from an instruction started 3
clock cycles earlier
Data Hazards classification
Based on order of read/write accesses in the instruction: RAW, WAW, WAR
• Convention: name the hazard by the ordering that must be preserved by the pipeline
• Consider two instructions: i and j, with i occurring before j
Read After Write (RAW):
• Cause of hazard: Read before write
• j tries to read an operand before it is written by i: j incorrectly reads old value of the operand
• Most common data hazard
• Overcome by forwarding
Write After Write (WAW):

• Cause of hazard: Write before write
• j tries to write an operand before it is written by i: leaves the value written by i rather than the value written by j
• Only present if pipeline writes in more than one pipe stage
Instruction i IF ID EX MEM WB WB WB
Instruction j IF ID EX MEM WB
Data Hazards classification
Write After Write (WAW):
• Not possible in DLX unless following changes are made: 1) Move WB of an ALU instruction into MEM stage, 2)
Data memory access (MEM) takes two pipe stages
R1 is written here by LW
LW R1, 0(R2) IF ID EX MEM1 MEM2 WB

ADD R1, R2, R3 IF ID EX WB
R1 is written here by ADD

Write After Read (WAR):
• Cause of hazard: Write before read
• j tries to write an operand before it is read by i: i incorrectly reads new value (value produced by instruction j)
• Not possible in DLX because all writes (WB) are after read (ID)
R2 is read here by SW (second half of CC5)
SW R2, 0(R1) IF ID EX MEM1 MEM2 WB

ADD R2, R3, R4 IF ID EX WB
R2 is written here by ADD (first half of CC5)

When stalls can’t be eliminated
• Not all potential data hazards can be handled by forwarding
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
• LW instruction doesn’t have data ready until end of CC4
• SUB instruction needs data at the beginning of CC4 (forwarding is not going to work!!)
• AND instruction needs data at the beginning of CC5 (forward from MEM/WB register to
ALU input latch)
• OR instruction reads R1 in second half of CC5, R1 is written by LW in first half of CC5
When stalls can’t be eliminated
• Stall the pipeline to avoid data hazards
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
IF ID EX MEM WB
IF ID Stall EX MEM WB
IF Stall ID EX MEM WB
Stall IF ID EX MEM
• No forwarding necessary for the AND instruction, read R1 through register file
• Load interlock
Pipeline scheduling
• Technique to reduce number of stalls in pipeline
• Rather than allowing pipeline to stall, compiler schedules pipeline by rearranging the instruction
sequence
• Use pipeline scheduling to avoid stalls for the following sequence of statements:
a = b + c; (a, b, c, d, e, f all are memory locations)
d = e – f;
• DLX assembly code for the sequence ?
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12 CC13 CC14
LW Rb, b IF ID EX MEM WB
LW Rc, c IF ID EX MEM WB
ADD Ra, Rb, Rc IF ID Stall EX MEM WB
SW Ra, a IF Stall ID EX MEM WB
LW Re, e Stall IF ID EX MEM WB
LW Rf, f IF ID EX MEM WB
SUB Rd, Re, Rf IF ID Stall EX MEM WB
SW Rd, d IF Stall ID EX MEM WB
Pipeline scheduling
• After pipeline scheduling:
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12
LW Rb, b IF ID EX MEM WB
LW Rc, c IF ID EX MEM WB
LW Re, e IF ID EX MEM WB
ADD Ra, Rb, Rc IF ID EX MEM WB
LW Rf, f IF ID EX MEM WB
SW Ra, a IF ID EX MEM WB
SUB Rd, Re, Rf IF ID EX MEM WB
SW Rd, d IF ID EX MEM WB
Find stalls due to data hazards (Assume no forwarding/ bypassing)
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Find stalls due to data hazards(Assume no forwarding/ bypassing)
Written in R1 (1st half of CC5)
Read R1 (2nd half of CC5)

IF ID EX MEM WB
IF Stall Stall ID EX MEM WB
Stall Stall IF Stall Stall ID EX MEM WB
Stall Stall IF ID EX MEM WB
IF Stall Stall ID EX MEM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12 CC13 CC14 CC15
Remove data hazards by forwarding
Stall of 4 CCs
CC1 CC2 CC3 CC4 CC5 CC6 CC7
ADD R1, R2, R3 IF ID EX MEM WB
LW R4, (R1) IF ID EX MEM WB
SW 12(R1), R4 IF ID EX MEM WB
No stalls!!!
Performance with data hazards
Assume 30% of the instructions are loads. Half the time, instruction following a load
instruction depends on the result of the load instruction.
If hazard causes a single cycle delay, how much faster is the ideal pipeline?
1) load inst is followed by inst which depends on result of load inst, CPI = 1+1 = 2
2) load inst is followed by inst which does not depend on result of load inst, CPI = 1
Average CPI = 1.5 (50% each case) Overall CPI = (0.3 x 1.5) + (0.7 x 1) = 1.15
30% instructions are load: CPI = 1.5
70% are not load: CPI = 1 Ideal CPI = 1, Thus ideal pipeline is 1.15 times faster

Pipelining and Parallelism

Uploaded by

Copyright:

Available Formats

Pipelining and Parallelism

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipelining and Parallelism

Uploaded by

Copyright:

Available Formats

Computer Architecture

Inst No Pipeline stages

• Speedup = no of pipeline stages

• Each pipeline stage separated from and

Hardware stage 2: Read IF/ID register

ID/EX. B ← IF/ID. IR11…15

ID/EX. I ← (IF/ID. IR16)16 ## IF/ID. IR16…31

Hardware stage 3: Read ID/EX register

Hardware stage 4: Read EX/MEM register

• When hazard happens, it is necessary to delay (stall) the pipeline

Machine with single memory unit

• Speedup using pipeline =

How many stalls?

Which machine is faster?

Suppose we introduce pipelining on this machine. Assume that when introducing

What is the speedup obtained from pipelining?

• ADDI R34, R5, R0

• All instructions after ADD instruction use result of ADD instruction

• SUB instruction reads R1 in ID stage (CC3): data hazard, reads and

• Pipeline stall clock cycles = 2 in order to avoid data hazards

Key insight in forwarding:

Forward from MEM/WB to ALU input

How Forwarding works?

Write After Write (WAW):

LW R1, 0(R2) IF ID EX MEM1 MEM2 WB

R1 is written here by ADD

SW R2, 0(R1) IF ID EX MEM1 MEM2 WB

R2 is written here by ADD (first half of CC5)

• DLX assembly code for the sequence ?

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Written in R1 (1st half of CC5)

Read R1 (2nd half of CC5)

IF Stall Stall ID EX MEM WB

ADD R1, R2, R3 IF ID EX MEM WB

LW R4, (R1) IF ID EX MEM WB

You might also like