COA 6 Pipeline Intro-1

Pipelined Microarchitecture:
Fundamentals of Design
Instructor: Hyokeun Lee Ph.D.

Reading Materials
 Chapter 4.6 – 4.10
2
Contents
 Evolution from Multi-Cycle to Pipeline
 Analysis of Simple Pipeline
 Datapath of Pipelined Microarchitecture
 Controls of Pipelined Microarchitecture
3
Limitation of Multi-Cycle Microarchitecture
 Benefits of multi-cycle design
• Divided an instruction lifecycle into a few lower latency steps
• Facilitate higher clock frequency than in single-cycle design
 Problem of multi-cycle design

• Process only one instruction during several cycles at a time
• Not all hardware resources are well utilized
cycles
Inst1 IF ID EXE MEM WB

15 cycles
for 3 insts!!!
IF ID EXE …
Inst3
4
Intuition: Pipeline Process in Factory
 A car needs several manufacturing process: engine, door, wheels…
 But factories do not do this separately at a time!
Executing instructions in a pipelined manner can enhance performance!
5
Similarly, Pipelining Instructions!
 Hardware designers want to let everything work every time
• (Mostly) higher throughput  shorter execution time  lower energy
 Let’s overlap states of different instructions!

• Can utilize each state’s resource at the same time
• E.g. Inst1’s WB, Inst2’s MEM, Inst3’s EXE, … are executed concurrently
0 1 2 3 4 5 6 Saved 8 cycles!
cycles
IF ID … WB
EXE MEM
Inst3
6
Term Clarification: Concurrency vs. Parallelism
 Parallelism
• Process multiple tasks exactly at the same time
• Usually need duplicates of a resource to support any kinds of execution
 Concurrency
• Process multiple tasks in during the same time  seemless execution
• Usually enhance the existing resources while executing multiple tasks
A part of my resources is going to
ALU 1 inst1 be free after inst1, let’s do inst2 then!
ALU 2 inst2 inst1

inst2
ALU
inst3
ALU 3 inst3
inst4
ALU 4 inst4
Hardware pipeline is a form of achieving concurrency!
7
Contents
• Simple Pipeline in Digital Systems
• Analysis of Simple Pipeline

8
Pipeline as Implementation
 Pipeline  concurrent execution of different phases of different instructions
 Need to preserve states for different instructions  what have we used til now?
Single-cycle
μarch
Control Multi-cycle
unit state μarch
IF ID EXE MEM
 Flip-flop or latch?
• Latch can be used for extremely optimizing performance  learn in “VLSI system”
• But let’s assume flip-flop-based registers in this class!
9
Power of Pipeline
 Can achieve high frequency and high utilization at the same time!
Combinational logic
(100 ns)
 More stages, higher clock frequency
Comb. logic Comb. logic

(50ns) (50ns)
 Can leverage microprogramming concept to divide more stages
Comb. Comb. Comb. Comb.

(25ns) (25ns) (25ns) (25ns)
10
Overhead of Pipeline: Performance Model
 Assume original logic latency is T, register latency is R
 Also assume logic is sliced uniformly in terms of latency and logics
 Non-pipelined logic
Combinational 1
Logic (T)
𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑇𝑇 + 𝑅𝑅
 n-stage pipelined logic  assume all stages filled with tasks
Comb. Comb. … Comb.

(T/n) (T/n) (T/n)
1 𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇 + 𝑅𝑅
𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = = T/R
𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑓𝑓𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑇𝑇 + 𝑅𝑅
+ 𝑅𝑅
𝑛𝑛 𝑛𝑛
11
Overhead of Pipeline: Cost Model
 Pipeline does not come with free lunch  Insert registers between stages
 Assume original logic requires M gates, register needs K
 Non-pipelined logic
Combinational
Logic (M)
# 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑀𝑀 + 𝐾𝐾
 n-stage pipelined logic
Comb. Comb. … Comb.

(M/n) (M/n) (M/n)
𝑀𝑀
# 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑠𝑠𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = ∗ 𝑛𝑛 + 𝐾𝐾 ∗ 𝑛𝑛 + 1 ≈ 𝑀𝑀 + 𝐾𝐾𝐾𝐾 ≈ 𝑀𝑀 (𝑖𝑖𝑖𝑖 𝑀𝑀 ≫ 𝐾𝐾)
𝑛𝑛
It seems there are benefits in terms of performance with minor cost…
12
Is Pipelined Microarchitecture Easy to Implement?
 How would you properly slice the microarchitecture below?
 Is it just about slicing the design?  hazards, exception, branch…
Design complexity is a major challenge in pipelined microarchitecture!
13
Contents
• Stage Slicing
• Modifications for LW/SW and ALU Instructions
14
Stage Slicing (1)
IF-stage ID-stage EXE-stage MEM-stage
Add
4 Add
<<2
rs r1
*r1
PC addr rt r2
Inst. ALU addr
r3 *r2 rdata
rd wdata[r3] wdata
Inst
memory Reg file Data
memory
imm (16b) Signed 32b
extend
WB
 Is it possible to slice further for more stages to boost the clock frequency?
 How to handle signals feeding back previous stages?  talk later
15
Stage Slicing (2)
 Assume: Memory ops take 200 ns, register ops take 100 ns, ALU takes 400 ns
 Case 1: Slice ALU into two uniform stages
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 Time
(ns)
Inst1 IF ID EXE1 EXE2 MEM WB
Inst2 IF ID EXE1 EXE2 MEM WB
Inst3 IF ID EXE1 EXE2 MEM
 Case 2: Slice ALU into four stages  same throughput & higher latency
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 Time
(ns)
EXE EXE EXE EXE
Inst1 IF ID
1 2 3 4
MEM
EXE EXE EXE EXE
Inst2 IF ID
1 2 3 4
EXE EXE EXE E
Inst3 IF ID
1 2 3
16
Stage-Registers: Registers between Sliced Stages
 Placed between stages  Pass “the whole state” to the next stage
 Still need some modifications for actual use!
Add
Add
4 <<2
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. r3 RD
*r2
rd wdata[r3]
Inst WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
IF-stage ID-stage MEM- WB-

EX-stage
stage stage
17
Placing Registers into Sliced Stages: LW
 How to properly write back data to rt? addr = GPR[rs] + signExtend(imm)
GPR[rt]  Mem[addr]
PC  PC + 4
Add
Add
4 <<2
Reg[rsA]
Mem[addr]
rs
r1
PCA+4
*r1
addrA
P rt
addr r2
C ALU addr
IRA
Inst. r3 RD
*r2
rd wdata[r3]
Inst WR
mem Reg file Data
imm immA mem
(16b) Signed 32b
extend
InstA : LW
18
Placing Registers into Sliced Stages: LW
 How to properly write back data to rt? addr = GPR[rs] + signExtend(imm)
 Need to propagate rt-index throughout exec. GPR[rt]  Mem[addr]
PC  PC + 4
Add
Add
4 <<2
Reg[rsA]
Mem[addr]
rs
r1
PCA+4
*r1
addrA
P rt
addr r2
C ALU addr
IRA
Inst. r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm immA mem
(16b) Signed 32b
extend
rtA
rtA
rtA
InstA : LW
19
Placing Registers into Sliced Stages: R-type ALU
 How to properly write data back to rd? dest_reg  rd
ALUOut  GPR[rs] + GPR[rt]
GPR[dest_reg]ALUOut
PC  PC + 4
Add
Add
4 <<2
Reg[rtB] Reg[rsB]
Not used
rs
r1
PCB+4
ALUOutB
P *r1
addr rt r2
C ALU addr
IRB
Inst. rd r3 RD
x *r2
ALUOutB
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
InstB : ADD
20
Placing Registers into Sliced Stages: R-type ALU
 How to properly write data back to rd? dest_reg  rd
ALUOut  GPR[rs] + GPR[rt]
 Propagate rd-index throughout process GPR[dest_reg]ALUOut
PC  PC + 4
Add
Add
4 <<2
Reg[rtB] Reg[rsB]
Not used
rs
r1
PCB+4
ALUOutB
P *r1
addr rt r2
C ALU addr
IRB
Inst. rd r3 RD
*r2
ALUOutB
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
rtB
rdB
rdB
InstB : ADD
rdB
21
Placing Registers into Sliced Stages: I-type ALU
 Can directly adapt previous architecture dest_reg  rt
ALUOut  GPR[rs] + sign-extend(imm)
 Propagate immediate field as operand GPR[dest_reg]ALUOut
PC  PC + 4
Add
Add
4 <<2
Reg[rsC]
Not used
rs
r1
PCC+4
ALUOutC
P *r1
addr rt r2
C ALU addr
IRC
Inst. rd r3 RD
*r2
ALUOutC
Inst wdata[r3] WR
rtC imm32C
mem Reg file Data
imm mem
(16b) Signed 32b
extend
rtC
rtC
InstC : ADDI
rdC
22
Putting All Together
 Is it the only design?
 Microarchitecture can be implementation-specific  if spec of ISA is followed
Add
Add
4 <<2
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
23
Putting All Together: Another Option
 Place the multiplexer right in front of register file
 Vs. previous design  more registers
Add
Add
4 <<2
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b
extend
24
Contents
• Sequencing Control Signals throughout Pipeline
• Further Issues to be Resolved
25
Don’t Forget Control Signals!
 Control signals must be asserted / de-asserted at the correct timing
 Still does not work  how to feed control signals in pipelined design?
MemReg
PCSrc Add
RegWrite Add
4 <<2 MemRead
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b ALUOp
extend ALUSrc MemWrite
RegDst
26
Sequencing Control Signals (1)
 Control signals are needed at different cycle at corresponding stage
 Option 1: Generate all control signals at ID-stage, propagate them
• Need duplicates of registers to propagate control signals for different stages
• Easy to slice if signal generation takes some cycles  good scalability for complex ISA
Control signal registers

WB for different stages
…
Control MEM WB
…
…
…
EX MEM WB
…
…
Instruction
ID EXE MEM
27
Sequencing Control Signals (2)
 Option 2: Propagate instruction itself and generate control signal at each stage
 Easy to apply if decoding at each stage does not bottleneck the stage itself
 Control logic complexity can be amortized across stages
Instruction
Control Control Control Control
RegWrite RegDst MemWr MemRead

ALUOp MemRd
PCSrc
ALUSrc
IF ID EXE MEM WB
It depends on designer according to the ISA and constraints!
28
Putting Control Sequencing into Pipeline
RegWr
PCSrc
WB
…
Control MEM WB
…
EX … MEM … WB
MemReg
Add Add
4 Branch
Br
<<2 True
rs
r1
P *r1
addr rt r2
C ALU addr
Inst. rd r3 RD
*r2
Inst wdata[r3] WR
mem Reg file Data
imm mem
(16b) Signed 32b ALUOp
extend ALUSrc MemWr
Done? MemRd
Sure?
RegDst
29
Further Design Issues
 This pipeline microarchitecture is yet to be deployed!
 Dependency may exist between instructions Data hazard
• In real program, many instructions are dependent of each other
• E.g., ADD consumer of LW  but MEM & EX overlapped!
 Branch needs special handling Control hazard

• If branch is taken, instructions at ID and EXE stages must be flushed (or squashed)
• As if nothing happened  performance penalty
 Instruction memory data memory may be unified Structural hazard

• IF stage & MEM stage cannot be overlapped due to structural limit
Hazards complicate processor pipeline design in practice!
30

COA 6 Pipeline Intro-1

Uploaded by

Copyright:

Available Formats

COA 6 Pipeline Intro-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COA 6 Pipeline Intro-1

Uploaded by

Copyright:

Available Formats

Pipelined Microarchitecture:

Instructor: Hyokeun Lee Ph.D.

 Problem of multi-cycle design

Inst1 IF ID EXE MEM WB

Executing instructions in a pipelined manner can enhance performance!

 Let’s overlap states of different instructions!

Inst1 IF ID EXE MEM WB

Inst2 IF ID EXE MEM WB

ALU 2 inst2 inst1

Hardware pipeline is a form of achieving concurrency!

 Datapath of Pipelined Microarchitecture

 More stages, higher clock frequency

Comb. logic Comb. logic

 Can leverage microprogramming concept to divide more stages

Comb. Comb. Comb. Comb.

 n-stage pipelined logic  assume all stages filled with tasks

Comb. Comb. … Comb.

 n-stage pipelined logic

Comb. Comb. … Comb.

It seems there are benefits in terms of performance with minor cost…

Design complexity is a major challenge in pipelined microarchitecture!

 Controls of Pipelined Microarchitecture

Inst1 IF ID EXE1 EXE2 MEM WB

Inst2 IF ID EXE1 EXE2 MEM WB

Inst3 IF ID EXE1 EXE2 MEM

IF-stage ID-stage MEM- WB-

Control signal registers

Control Control Control Control

RegWrite RegDst MemWr MemRead

It depends on designer according to the ISA and constraints!

 Branch needs special handling Control hazard

 Instruction memory data memory may be unified Structural hazard

Hazards complicate processor pipeline design in practice!

You might also like