COA 6 Pipeline Intro-1
COA 6 Pipeline Intro-1
COA 6 Pipeline Intro-1
Fundamentals of Design
2
Contents
Evolution from Multi-Cycle to Pipeline
Analysis of Simple Pipeline
Datapath of Pipelined Microarchitecture
Controls of Pipelined Microarchitecture
3
Limitation of Multi-Cycle Microarchitecture
Benefits of multi-cycle design
• Divided an instruction lifecycle into a few lower latency steps
• Facilitate higher clock frequency than in single-cycle design
cycles
IF ID EXE …
Inst3
4
Intuition: Pipeline Process in Factory
A car needs several manufacturing process: engine, door, wheels…
But factories do not do this separately at a time!
5
Similarly, Pipelining Instructions!
Hardware designers want to let everything work every time
• (Mostly) higher throughput shorter execution time lower energy
0 1 2 3 4 5 6 Saved 8 cycles!
cycles
IF ID … WB
EXE MEM
Inst3
6
Term Clarification: Concurrency vs. Parallelism
Parallelism
• Process multiple tasks exactly at the same time
• Usually need duplicates of a resource to support any kinds of execution
Concurrency
• Process multiple tasks in during the same time seemless execution
• Usually enhance the existing resources while executing multiple tasks
A part of my resources is going to
ALU 1 inst1 be free after inst1, let’s do inst2 then!
ALU 4 inst4
7
Contents
Evolution from Multi-Cycle to Pipeline
Analysis of Simple Pipeline
• Simple Pipeline in Digital Systems
• Analysis of Simple Pipeline
8
Pipeline as Implementation
Pipeline concurrent execution of different phases of different instructions
Need to preserve states for different instructions what have we used til now?
Single-cycle
μarch
Control Multi-cycle
unit state μarch
IF ID EXE MEM
Flip-flop or latch?
• Latch can be used for extremely optimizing performance learn in “VLSI system”
• But let’s assume flip-flop-based registers in this class!
9
Power of Pipeline
Can achieve high frequency and high utilization at the same time!
Combinational logic
(100 ns)
10
Overhead of Pipeline: Performance Model
Assume original logic latency is T, register latency is R
Also assume logic is sliced uniformly in terms of latency and logics
Non-pipelined logic
Combinational 1
Logic (T)
𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑇𝑇 + 𝑅𝑅
1 𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇 + 𝑅𝑅
𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = = T/R
𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑇𝑇 + 𝑅𝑅
+ 𝑅𝑅
𝑛𝑛 𝑛𝑛
11
Overhead of Pipeline: Cost Model
Pipeline does not come with free lunch Insert registers between stages
Assume original logic requires M gates, register needs K
Non-pipelined logic
Combinational
Logic (M)
# 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑀𝑀 + 𝐾𝐾
𝑀𝑀
# 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑠𝑠𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = ∗ 𝑛𝑛 + 𝐾𝐾 ∗ 𝑛𝑛 + 1 ≈ 𝑀𝑀 + 𝐾𝐾𝐾𝐾 ≈ 𝑀𝑀 (𝑖𝑖𝑖𝑖 𝑀𝑀 ≫ 𝐾𝐾)
𝑛𝑛
12
Is Pipelined Microarchitecture Easy to Implement?
How would you properly slice the microarchitecture below?
Is it just about slicing the design? hazards, exception, branch…
13
Contents
Evolution from Multi-Cycle to Pipeline
Analysis of Simple Pipeline
Datapath of Pipelined Microarchitecture
• Stage Slicing
• Modifications for LW/SW and ALU Instructions
14
Stage Slicing (1)
IF-stage ID-stage EXE-stage MEM-stage
Add
4 Add
<<2
rs r1
*r1
PC addr rt r2
Inst. ALU addr
r3 *r2 rdata
rd wdata[r3] wdata
Inst
memory Reg file Data
memory
imm (16b) Signed 32b
extend
WB
Is it possible to slice further for more stages to boost the clock frequency?
How to handle signals feeding back previous stages? talk later
15
Stage Slicing (2)
Assume: Memory ops take 200 ns, register ops take 100 ns, ALU takes 400 ns
Case 1: Slice ALU into two uniform stages
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 Time
(ns)
Case 2: Slice ALU into four stages same throughput & higher latency
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 Time
(ns)
EXE EXE EXE EXE
Inst1 IF ID
1 2 3 4
MEM
EXE EXE EXE EXE
Inst2 IF ID
1 2 3 4
EXE EXE EXE E
Inst3 IF ID
1 2 3
16
Stage-Registers: Registers between Sliced Stages
Placed between stages Pass “the whole state” to the next stage
Still need some modifications for actual use!
Add
Add
4 <<2
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. r3 RD
*r2
rd wdata[r3]
Inst WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
17
Placing Registers into Sliced Stages: LW
How to properly write back data to rt? addr = GPR[rs] + signExtend(imm)
GPR[rt] Mem[addr]
PC PC + 4
Add
Add
4 <<2
Reg[rsA]
Mem[addr]
rs
r1
PCA+4
*r1
addrA
P rt
addr r2
C ALU addr
IRA
Inst. r3 RD
*r2
rd wdata[r3]
Inst WR
mem Reg file Data
imm immA mem
(16b) Signed 32b
extend
InstA : LW
18
Placing Registers into Sliced Stages: LW
How to properly write back data to rt? addr = GPR[rs] + signExtend(imm)
Need to propagate rt-index throughout exec. GPR[rt] Mem[addr]
PC PC + 4
Add
Add
4 <<2
Reg[rsA]
Mem[addr]
rs
r1
PCA+4
*r1
addrA
P rt
addr r2
C ALU addr
IRA
Inst. r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm immA mem
(16b) Signed 32b
extend
rtA
rtA
rtA
InstA : LW
19
Placing Registers into Sliced Stages: R-type ALU
How to properly write data back to rd? dest_reg rd
ALUOut GPR[rs] + GPR[rt]
GPR[dest_reg]ALUOut
PC PC + 4
Add
Add
4 <<2
Reg[rtB] Reg[rsB]
Not used
rs
r1
PCB+4
ALUOutB
P *r1
addr rt r2
C ALU addr
IRB
Inst. rd r3 RD
x *r2
ALUOutB
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
InstB : ADD
20
Placing Registers into Sliced Stages: R-type ALU
How to properly write data back to rd? dest_reg rd
ALUOut GPR[rs] + GPR[rt]
Propagate rd-index throughout process GPR[dest_reg]ALUOut
PC PC + 4
Add
Add
4 <<2
Reg[rtB] Reg[rsB]
Not used
rs
r1
PCB+4
ALUOutB
P *r1
addr rt r2
C ALU addr
IRB
Inst. rd r3 RD
*r2
ALUOutB
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
rtB
rdB
rdB
InstB : ADD
rdB
21
Placing Registers into Sliced Stages: I-type ALU
Can directly adapt previous architecture dest_reg rt
ALUOut GPR[rs] + sign-extend(imm)
Propagate immediate field as operand GPR[dest_reg]ALUOut
PC PC + 4
Add
Add
4 <<2
Reg[rsC]
Not used
rs
r1
PCC+4
ALUOutC
P *r1
addr rt r2
C ALU addr
IRC
Inst. rd r3 RD
*r2
ALUOutC
Inst wdata[r3] WR
rtC imm32C
mem Reg file Data
imm mem
(16b) Signed 32b
extend
rtC
rtC
InstC : ADDI
rdC
22
Putting All Together
Is it the only design?
Microarchitecture can be implementation-specific if spec of ISA is followed
Add
Add
4 <<2
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
23
Putting All Together: Another Option
Place the multiplexer right in front of register file
Vs. previous design more registers
Add
Add
4 <<2
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
24
Contents
Evolution from Multi-Cycle to Pipeline
Analysis of Simple Pipeline
Datapath of Pipelined Microarchitecture
Controls of Pipelined Microarchitecture
• Sequencing Control Signals throughout Pipeline
• Further Issues to be Resolved
25
Don’t Forget Control Signals!
Control signals must be asserted / de-asserted at the correct timing
Still does not work how to feed control signals in pipelined design?
MemReg
PCSrc Add
RegWrite Add
4 <<2 MemRead
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b ALUOp
extend ALUSrc MemWrite
RegDst
26
Sequencing Control Signals (1)
Control signals are needed at different cycle at corresponding stage
Option 1: Generate all control signals at ID-stage, propagate them
• Need duplicates of registers to propagate control signals for different stages
• Easy to slice if signal generation takes some cycles good scalability for complex ISA
…
Control MEM WB
…
…
…
EX MEM WB
…
…
Instruction
ID EXE MEM
27
Sequencing Control Signals (2)
Option 2: Propagate instruction itself and generate control signal at each stage
Easy to apply if decoding at each stage does not bottleneck the stage itself
Control logic complexity can be amortized across stages
Instruction
IF ID EXE MEM WB
28
Putting Control Sequencing into Pipeline
RegWr
PCSrc
WB
…
Control MEM WB
…
EX … MEM … WB
MemReg
Add Add
4 Branch
Br
<<2 True
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b ALUOp
extend ALUSrc MemWr
Done? MemRd
Sure?
RegDst
29
Further Design Issues
This pipeline microarchitecture is yet to be deployed!
Dependency may exist between instructions Data hazard
• In real program, many instructions are dependent of each other
• E.g., ADD consumer of LW but MEM & EX overlapped!
30