Pipelining and Parallelism
Pipelining and Parallelism
Pipelining and Parallelism
UEC509-Part-3
Dr. Debabrata Ghosh
Assistant Professor, ECED
Thapar University
Pipelining
• Overlapping the execution of multiple instructions
• During each clock cycle a new instruction is initiated
• During each clock cycle different parts of different instructions are executed. Each part is
called a pipeline stage or pipe stage
• Machine cycle: time to complete single pipeline stage. Depends on slowest pipeline stage
• Designer goal: balance the length of each pipeline stage
Pipelining
Decrease in CPI
• Machine takes multiple CCs per instruction
DLX datapath in pipeline format
• Only major functional units shown in different CC: introduces few conflicts
DLX datapath in pipeline format
3 important observations:
1) Two separate memories (IM and
Written DM) used to avoid conflict of resource
between IF of one instruction and
MEM of another in same CC
2) Same Reg (Register subsystem)
used in ID and WB stages.
Will there be conflict of resource
between ID of one instruction and WB
Read of another? YES. How to avoid?
In ID stage Reg is only read, whereas
in WB stage Reg is only written. Split
the CC into two halves: one half for
write and another half for read
DLX datapath in pipeline format
3) PC is not shown. In unpipelined datapath, PC is written twice: once in IF stage and
once in MEM stage (for branch or jump). In pipelined datapath, to avoid two different
instructions writing two different values in PC, writing PC will be done only once, in IF
stage.
Pipeline registers/latches
ID/EX. NPC ← IF/ID. NPC (Not going to be used anymore for this instruction)
ID/EX.IR ← IF/ID. IR
EX/MEM.IR ← ID/EX. IR
MEM/WB.IR ← EX/MEM.IR
Performance using pipelining
• Pipeline increases execution time of an individual instruction : due to overhead
(increased hardware, delay from pipeline registers etc.)
• However, pipeline increases CPU throughput (no of instructions executed per
unit time). Program runs faster.
Performance using pipelining
Assume a unpipelined machine has 10-ns clock cycles and it uses four cycles for
ALU operations and branches and five cycles for memory operations. Assume that
the relative frequencies of these operations are 40%, 20%, and 40%, respectively.
Suppose that due to clock skew and setup, pipelining the machine adds 1 ns of
overhead to the clock. Ignoring any latency impact, how much speedup in the
instruction execution rate will we gain from a pipeline?
(Assume pipeline is full)
Performance using pipelining
Assume that the times required for the five functional units, which operate in each
of the five cycles, are as follows: 10 ns, 8 ns, 10 ns, 10 ns, and 7 ns. Assume that
pipelining adds 1 ns of overhead. How much speedup in the instruction execution
rate will be gained from a pipeline?
(Assume pipeline is full)
Performance using pipelining
Assume a unpipelined machine takes 100 ns to execute an instruction. Assume
that the time required for each of the five functional units, which operate in each
of the five cycles, are 20 ns. Determine speedup ratio of the pipeline for executing
1000 instructions.
Pipeline hazards
• 3 types of problems (hazards) limit pipeline effectiveness
1) Structural hazards: caused due to conflict of datapath resources. Two
instructions trying to access same resource in same clock cycle
2) Data hazards: caused when an instruction depends on results of
previous instruction which is not yet available
3) Control hazards: caused when pipelining branch or other instructions
which change PC
BEQZ R4, Addr
If Regs [R4] == 0
PC ← PC + 4 + signext (Addr)
Unaffected by stall
Affected by stall
Performance of pipeline with stalls
• Speedup using pipeline (ideal case, no stall) = number of pipeline
stages i.e. pipeline depth
• Speedup using pipeline =
= x
• Pipeline reduces average execution time per instruction
Pipeline
•Decrease in CPIdecreases CPI (Assume CC time unpipelined = CC time pipelined, ignore the CC time overhead due
to pipelining)
• Speedup using pipeline =
Performance of pipeline with stalls
• Speedup using pipeline =
• CPI pipelined ideal = 1 (i.e., no stall)
• CPI pipelined with stall = CPI pipelined ideal + CC of pipeline stall per instruction
Here stall is 1 CC
Load instruction
• To resolve, stall the pipeline for one CC when a data memory reference instruction occurs
Find stalls due to structural hazards
Load instruction
Load instruction
Instruction i IF ID EX MEM WB WB WB
Instruction j IF ID EX MEM WB
Data Hazards classification
Write After Write (WAW):
• Not possible in DLX unless following changes are made: 1) Move WB of an ALU instruction into MEM stage, 2)
Data memory access (MEM) takes two pipe stages
R1 is written here by LW
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12 CC13 CC14
LW Rb, b IF ID EX MEM WB
LW Rc, c IF ID EX MEM WB
ADD Ra, Rb, Rc IF ID Stall EX MEM WB
SW Ra, a IF Stall ID EX MEM WB
LW Re, e Stall IF ID EX MEM WB
LW Rf, f IF ID EX MEM WB
SUB Rd, Re, Rf IF ID Stall EX MEM WB
SW Rd, d IF Stall ID EX MEM WB
Pipeline scheduling
• After pipeline scheduling:
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12
LW Rb, b IF ID EX MEM WB
LW Rc, c IF ID EX MEM WB
LW Re, e IF ID EX MEM WB
ADD Ra, Rb, Rc IF ID EX MEM WB
LW Rf, f IF ID EX MEM WB
SW Ra, a IF ID EX MEM WB
SUB Rd, Re, Rf IF ID EX MEM WB
SW Rd, d IF ID EX MEM WB
Find stalls due to data hazards (Assume no forwarding/ bypassing)
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12 CC13 CC14 CC15
Remove data hazards by forwarding
Stall of 4 CCs
CC1 CC2 CC3 CC4 CC5 CC6 CC7
SW 12(R1), R4 IF ID EX MEM WB
No stalls!!!
Performance with data hazards
Assume 30% of the instructions are loads. Half the time, instruction following a load
instruction depends on the result of the load instruction.
If hazard causes a single cycle delay, how much faster is the ideal pipeline?
1) load inst is followed by inst which depends on result of load inst, CPI = 1+1 = 2
2) load inst is followed by inst which does not depend on result of load inst, CPI = 1
Average CPI = 1.5 (50% each case) Overall CPI = (0.3 x 1.5) + (0.7 x 1) = 1.15
30% instructions are load: CPI = 1.5
70% are not load: CPI = 1 Ideal CPI = 1, Thus ideal pipeline is 1.15 times faster