COA 6 Pipeline Intro-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Pipelined Microarchitecture:

Fundamentals of Design

Instructor: Hyokeun Lee Ph.D.


Reading Materials
 Chapter 4.6 – 4.10

2
Contents
 Evolution from Multi-Cycle to Pipeline
 Analysis of Simple Pipeline
 Datapath of Pipelined Microarchitecture
 Controls of Pipelined Microarchitecture

3
Limitation of Multi-Cycle Microarchitecture
 Benefits of multi-cycle design
• Divided an instruction lifecycle into a few lower latency steps
• Facilitate higher clock frequency than in single-cycle design

 Problem of multi-cycle design


• Process only one instruction during several cycles at a time
• Not all hardware resources are well utilized

cycles

Inst1 IF ID EXE MEM WB


15 cycles
for 3 insts!!!
Inst2 IF ID EXE MEM WB

IF ID EXE …
Inst3

4
Intuition: Pipeline Process in Factory
 A car needs several manufacturing process: engine, door, wheels…
 But factories do not do this separately at a time!

Executing instructions in a pipelined manner can enhance performance!

5
Similarly, Pipelining Instructions!
 Hardware designers want to let everything work every time
• (Mostly) higher throughput  shorter execution time  lower energy

 Let’s overlap states of different instructions!


• Can utilize each state’s resource at the same time
• E.g. Inst1’s WB, Inst2’s MEM, Inst3’s EXE, … are executed concurrently

0 1 2 3 4 5 6 Saved 8 cycles!
cycles

Inst1 IF ID EXE MEM WB

Inst2 IF ID EXE MEM WB

IF ID … WB
EXE MEM
Inst3

6
Term Clarification: Concurrency vs. Parallelism
 Parallelism
• Process multiple tasks exactly at the same time
• Usually need duplicates of a resource to support any kinds of execution

 Concurrency
• Process multiple tasks in during the same time  seemless execution
• Usually enhance the existing resources while executing multiple tasks
A part of my resources is going to
ALU 1 inst1 be free after inst1, let’s do inst2 then!

ALU 2 inst2 inst1


inst2
ALU
inst3
ALU 3 inst3
inst4

ALU 4 inst4

Hardware pipeline is a form of achieving concurrency!

7
Contents
 Evolution from Multi-Cycle to Pipeline
 Analysis of Simple Pipeline
• Simple Pipeline in Digital Systems
• Analysis of Simple Pipeline

 Datapath of Pipelined Microarchitecture


 Controls of Pipelined Microarchitecture

8
Pipeline as Implementation
 Pipeline  concurrent execution of different phases of different instructions
 Need to preserve states for different instructions  what have we used til now?

Single-cycle
μarch

Control Multi-cycle
unit state μarch

IF ID EXE MEM

 Flip-flop or latch?
• Latch can be used for extremely optimizing performance  learn in “VLSI system”
• But let’s assume flip-flop-based registers in this class!

9
Power of Pipeline
 Can achieve high frequency and high utilization at the same time!

Combinational logic
(100 ns)

 More stages, higher clock frequency

Comb. logic Comb. logic


(50ns) (50ns)

 Can leverage microprogramming concept to divide more stages

Comb. Comb. Comb. Comb.


(25ns) (25ns) (25ns) (25ns)

10
Overhead of Pipeline: Performance Model
 Assume original logic latency is T, register latency is R
 Also assume logic is sliced uniformly in terms of latency and logics
 Non-pipelined logic

Combinational 1
Logic (T)
𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑇𝑇 + 𝑅𝑅

 n-stage pipelined logic  assume all stages filled with tasks

Comb. Comb. … Comb.


(T/n) (T/n) (T/n)

1 𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇 + 𝑅𝑅
𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = = T/R
𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑇𝑇 + 𝑅𝑅
+ 𝑅𝑅
𝑛𝑛 𝑛𝑛

11
Overhead of Pipeline: Cost Model
 Pipeline does not come with free lunch  Insert registers between stages
 Assume original logic requires M gates, register needs K
 Non-pipelined logic

Combinational
Logic (M)
# 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑀𝑀 + 𝐾𝐾

 n-stage pipelined logic

Comb. Comb. … Comb.


(M/n) (M/n) (M/n)

𝑀𝑀
# 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑠𝑠𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = ∗ 𝑛𝑛 + 𝐾𝐾 ∗ 𝑛𝑛 + 1 ≈ 𝑀𝑀 + 𝐾𝐾𝐾𝐾 ≈ 𝑀𝑀 (𝑖𝑖𝑖𝑖 𝑀𝑀 ≫ 𝐾𝐾)
𝑛𝑛

It seems there are benefits in terms of performance with minor cost…

12
Is Pipelined Microarchitecture Easy to Implement?
 How would you properly slice the microarchitecture below?
 Is it just about slicing the design?  hazards, exception, branch…

Design complexity is a major challenge in pipelined microarchitecture!

13
Contents
 Evolution from Multi-Cycle to Pipeline
 Analysis of Simple Pipeline
 Datapath of Pipelined Microarchitecture
• Stage Slicing
• Modifications for LW/SW and ALU Instructions

 Controls of Pipelined Microarchitecture

14
Stage Slicing (1)
IF-stage ID-stage EXE-stage MEM-stage

Add
4 Add
<<2
rs r1
*r1
PC addr rt r2
Inst. ALU addr
r3 *r2 rdata
rd wdata[r3] wdata
Inst
memory Reg file Data
memory
imm (16b) Signed 32b
extend

WB
 Is it possible to slice further for more stages to boost the clock frequency?
 How to handle signals feeding back previous stages?  talk later
15
Stage Slicing (2)
 Assume: Memory ops take 200 ns, register ops take 100 ns, ALU takes 400 ns
 Case 1: Slice ALU into two uniform stages
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 Time
(ns)

Inst1 IF ID EXE1 EXE2 MEM WB

Inst2 IF ID EXE1 EXE2 MEM WB

Inst3 IF ID EXE1 EXE2 MEM

 Case 2: Slice ALU into four stages  same throughput & higher latency
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 Time
(ns)
EXE EXE EXE EXE
Inst1 IF ID
1 2 3 4
MEM
EXE EXE EXE EXE
Inst2 IF ID
1 2 3 4
EXE EXE EXE E
Inst3 IF ID
1 2 3

16
Stage-Registers: Registers between Sliced Stages
 Placed between stages  Pass “the whole state” to the next stage
 Still need some modifications for actual use!

Add
Add
4 <<2
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. r3 RD
*r2
rd wdata[r3]
Inst WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend

IF-stage ID-stage MEM- WB-


EX-stage
stage stage

17
Placing Registers into Sliced Stages: LW
 How to properly write back data to rt? addr = GPR[rs] + signExtend(imm)
GPR[rt]  Mem[addr]
PC  PC + 4

Add
Add
4 <<2

Reg[rsA]

Mem[addr]
rs
r1
PCA+4

*r1

addrA
P rt
addr r2
C ALU addr
IRA

Inst. r3 RD
*r2
rd wdata[r3]
Inst WR
mem Reg file Data
imm immA mem
(16b) Signed 32b
extend

InstA : LW

18
Placing Registers into Sliced Stages: LW
 How to properly write back data to rt? addr = GPR[rs] + signExtend(imm)
 Need to propagate rt-index throughout exec. GPR[rt]  Mem[addr]
PC  PC + 4

Add
Add
4 <<2

Reg[rsA]

Mem[addr]
rs
r1
PCA+4

*r1

addrA
P rt
addr r2
C ALU addr
IRA

Inst. r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm immA mem
(16b) Signed 32b
extend
rtA

rtA

rtA
InstA : LW

19
Placing Registers into Sliced Stages: R-type ALU
 How to properly write data back to rd? dest_reg  rd
ALUOut  GPR[rs] + GPR[rt]
GPR[dest_reg]ALUOut
PC  PC + 4

Add
Add
4 <<2

Reg[rtB] Reg[rsB]
Not used
rs
r1
PCB+4

ALUOutB
P *r1
addr rt r2
C ALU addr
IRB

Inst. rd r3 RD
x *r2

ALUOutB
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend

InstB : ADD

20
Placing Registers into Sliced Stages: R-type ALU
 How to properly write data back to rd? dest_reg  rd
ALUOut  GPR[rs] + GPR[rt]
 Propagate rd-index throughout process GPR[dest_reg]ALUOut
PC  PC + 4

Add
Add
4 <<2

Reg[rtB] Reg[rsB]
Not used
rs
r1
PCB+4

ALUOutB
P *r1
addr rt r2
C ALU addr
IRB

Inst. rd r3 RD
*r2

ALUOutB
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
rtB

rdB

rdB
InstB : ADD
rdB

21
Placing Registers into Sliced Stages: I-type ALU
 Can directly adapt previous architecture dest_reg  rt
ALUOut  GPR[rs] + sign-extend(imm)
 Propagate immediate field as operand GPR[dest_reg]ALUOut
PC  PC + 4

Add
Add
4 <<2

Reg[rsC]
Not used
rs
r1
PCC+4

ALUOutC
P *r1
addr rt r2
C ALU addr
IRC

Inst. rd r3 RD
*r2

ALUOutC
Inst wdata[r3] WR

rtC imm32C
mem Reg file Data
imm mem
(16b) Signed 32b
extend

rtC

rtC
InstC : ADDI
rdC

22
Putting All Together
 Is it the only design?
 Microarchitecture can be implementation-specific  if spec of ISA is followed

Add
Add
4 <<2
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend

23
Putting All Together: Another Option
 Place the multiplexer right in front of register file
 Vs. previous design  more registers

Add
Add
4 <<2
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend

24
Contents
 Evolution from Multi-Cycle to Pipeline
 Analysis of Simple Pipeline
 Datapath of Pipelined Microarchitecture
 Controls of Pipelined Microarchitecture
• Sequencing Control Signals throughout Pipeline
• Further Issues to be Resolved

25
Don’t Forget Control Signals!
 Control signals must be asserted / de-asserted at the correct timing
 Still does not work  how to feed control signals in pipelined design?
MemReg

PCSrc Add
RegWrite Add
4 <<2 MemRead
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b ALUOp
extend ALUSrc MemWrite

RegDst

26
Sequencing Control Signals (1)
 Control signals are needed at different cycle at corresponding stage
 Option 1: Generate all control signals at ID-stage, propagate them
• Need duplicates of registers to propagate control signals for different stages
• Easy to slice if signal generation takes some cycles  good scalability for complex ISA

Control signal registers


WB for different stages


Control MEM WB



EX MEM WB


Instruction

ID EXE MEM

27
Sequencing Control Signals (2)
 Option 2: Propagate instruction itself and generate control signal at each stage
 Easy to apply if decoding at each stage does not bottleneck the stage itself
 Control logic complexity can be amortized across stages

Instruction

Control Control Control Control

RegWrite RegDst MemWr MemRead


ALUOp MemRd
PCSrc
ALUSrc

IF ID EXE MEM WB

It depends on designer according to the ISA and constraints!

28
Putting Control Sequencing into Pipeline

RegWr
PCSrc
WB


Control MEM WB


EX … MEM … WB

MemReg
Add Add
4 Branch
Br
<<2 True
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b ALUOp
extend ALUSrc MemWr
Done? MemRd
Sure?

RegDst

29
Further Design Issues
 This pipeline microarchitecture is yet to be deployed!
 Dependency may exist between instructions Data hazard
• In real program, many instructions are dependent of each other
• E.g., ADD consumer of LW  but MEM & EX overlapped!

 Branch needs special handling Control hazard


• If branch is taken, instructions at ID and EXE stages must be flushed (or squashed)
• As if nothing happened  performance penalty

 Instruction memory data memory may be unified Structural hazard


• IF stage & MEM stage cannot be overlapped due to structural limit

Hazards complicate processor pipeline design in practice!

30

You might also like