CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
CS211 1
CS211 2
Appendix A of Textbook
Also parts of Chapter 2
CS211 3
CS211 4
Instruction Pipeline
Instruction execution process lends itself
naturally to pipelining
overlap the subtasks of instruction fetch, decode
and execute
CS211 5
Linear Pipeline
Processor
S1
S2
Sk
CS211 6
Synchronous Pipeline
All transfers simultaneous
One task or operation enters the pipeline per cycle
Processors reservation table : diagonal
CS211 7
S3
T1
S2
S1
T1
T2
T2
T3
T1
T2
T3
T4
Asynchronous Pipeline
Transfers performed when individual processors are ready
Handshaking protocol between processors
Mainly used in multiprocessor systems with message-passing
CS211 9
Si+1
= max { m } + d
Pipeline frequency : f
f=1/
CS211 10
T1
Tk
nk
= [ k + (n-1)]
nk
k + (n-1)
CS211 11
10
Sk
k
n
k + (n-1)
n
=
[ k + (n-1)]
nf
k + (n-1)
CS211 12
Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns
(nanoseconds)
latch delay = 10
Pipeline cycle time = 90+10 = 100 ns
For non-pipelined execution
time = 60+50+90+80 = 280 ns
CS211 13
CS211 14
CS211 15
Reservation Tables
Reservation table : displays time-space flow of data
through the pipeline analogous to opcode of pipeline
Not diagonal, as in linear pipelines
CS211 16
14
CS211 17
Latency Analysis
Latency : the number of clock cycles
between two initiations of the pipeline
Collision : an attempt by two initiations to use
the same pipeline stage at the same time
Some latencies cause collision, some not
CS211 18
16
Collisions (Example)
x1
x2
x1
x3
x1
x4
x1x2
x2x3
x3x4
x4
x1x2
x1
x2x3
x1x2
x1x2x4
10
x2x3x4
Latency = 2
CS211 19
17
Latency Cycle
x1
9 10 11 12 13 14 15 16 17 18
x1 x2 x1
x1
x1
x1
x2 x3 x2
x2
x1
x1
x3
x2
x2
x3
x2
Cycle
x2
x3
x3
x3
Cycle
18
CS211 21
19
Collision Vector
Reservation Table
x
x
x
x
x1
x1
x1
x1
x1
Value X1
x2
x2
x2
x1
x2
x2
x2
Value X2
C = (? ? . . . ? ?)
CS211 22
CS211 23
Designing a Pipelined
Processor
CS211 24
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
Next SEQ PC
Adder
Zero?
RS1
L
M
D
MUX
Data
Memory
ALU
Imm
MUX MUX
RD
Reg File
Inst
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
Sign
Extend
WB Data
CS211 25
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
Zero?
RS1
MUX
MEM/WB
Data
Memory
EX/MEM
ALU
MUX MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
WB Data
Instruction
Fetch
Sign
Extend
RD
RD
RD
CS211 26
Graphically Representing
Pipelines
Visualizing Pipelining
Time (clock cycles)
Ifetch
DMem
Reg
Ifetch
Reg
Ifetch
Reg
DMem
Reg
Reg
DMem
ALU
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Reg
DMem
Reg
CS211 28
Exec
IFetch Dcd
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
IFetch Dcd
IFetch Dcd
IFetch Dcd
Program Flow
IFetch Dcd
WB
CS211 29
Clk
Single Cycle,
Multiple
Cycle,
vs.
Cycle 1
Cycle 2
Pipeline
Store
Waste
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
Multiple Cycle Implementation:
Load
Ifetch
Reg
Exec
Mem
Wr
Exec
Mem
Wr
Reg
Exec
Mem
Store
Ifetch
Reg
Exec
Mem
R-type
Ifetch
Pipeline Implementation:
Load Ifetch
Reg
Store Ifetch
R-type Ifetch
Reg
Exec
Wr
Mem
Wr
CS211 30
Reg/Dec
Exec
Cycle 4 Cycle 5
Mem
Wr
CS211 31
Reg/Dec
Exec
Wr
CS211 32
Clock
R-type Ifetch
R-type
Reg/Dec
Exec
Ifetch
Reg/Dec
Exec
Ifetch
Reg/Dec
Load
Wr
R-type Ifetch
Cycle 8 Cycle 9
Wr
Exec
Mem
Wr
Reg/Dec
Exec
Wr
R-type Ifetch
Reg/Dec
Exec
Wr
Important
Observation
Each
functional unit can only be used once per
instruction
Each functional unit must be used at the same stage
for all instructions:
Load uses Register Files Write Port during its 5th stage
Load
Ifetch
Reg/Dec
3
Exec
Mem
Wr
R-type uses Register Files Write Port during its 4th stage
1
R-type Ifetch
2
Reg/Dec
3
Exec
4
Wr
CS211 34
Cycle 9
Clock
Ifetch
Load
Reg/Dec
Exec
Ifetch
Reg/Dec
R-type Ifetch
Wr
Exec
Mem
Reg/Dec
Exec
Wr
Wr
Wr
Exec
Reg/Dec
Wr
Exec
R-type Ifetch
Cycle 1 Cycle 2
Reg/Dec
3
Exec
4
Mem
5
Wr
Clock
R-type Ifetch
R-type
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Load
R-type Ifetch
R-type Ifetch
Wr
CS211 36
Why
Pipeline?
Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
CS211 37
Inst 3
Im
Dm
Reg
Dm
Im
Reg
Im
Reg
Reg
Reg
Dm
ALU
Inst 4
Reg
Reg
ALU
Inst 2
Im
Dm
ALU
Inst 1
Reg
ALU
O
r
d
e
r
Inst 0
Im
ALU
I
n
s
t
r.
Reg
Dm
Reg
CS211 38
CS211 39
Cycle Timeunpipelined
Ideal CPI Pipeline depth
Speedup
Instr 4
Ifetch
Reg
Ifetch
DMem
ALU
Reg
Reg
Reg
Ifetch
Reg
DMem
Reg
DMem
Reg
ALU
Instr 3
Ifetch
DMem
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
t Instr 1
r.
ALU
Reg
DMem
Reg
CS211 42
Stall
Instr 3
Reg
Ifetch
Reg
Bubble
Reg
DMem
ALU
Ifetch
DMem
Reg
DMem
Bubble Bubble
Ifetch
Reg
Reg
Bubble
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
t Instr 1
r.
ALU
Bubble
DMem
Reg
CS211 43
Example
Machine A: Dual ported memory (Harvard
Architecture)
Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
Ideal CPI = 1 for both
Loads are 40% of instructions executed
SpeedUpA = Pipe. Depth/(1 + 0) x (clock unpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipe. Depth/(1 + 0.4 x 1) x
(clockunpipe/(clockunpipe /
1.05)
= (Pipe. Depth/1.4) x 1.05
= 0.75 x Pipe. Depth
SpeedUpA / SpeedUpB = Pipe. Depth/(0.75 x Pipe. Depth) = 1.33
Data Dependencies
True dependencies and False dependencies
false implies we can remove the dependency
true implies we are stuck with it!
CS211 46
CS211 47
RAW Dependency
Example program (a) with two instructions
i1: load r1, a;
i2: add r2, r1,r1;
CS211 50
CS211 51
CS211 52
CS211 53
Hazard Detection in
Hardware
A RAW hazard exists on register if Rregs( i ) Wregs( j )
Keep a record of pending writes (for inst's in the pipe) and
compare with operand regs of current instruction.
When instruction issues, reserve its result register.
When on operation completes, remove its write reservation.
CS211 55
CS211 56
Data Hazard on R1
Time (clock cycles)
and r6,r1,r7
or
r8,r1,r9
xor r10,r1,r11
Ifetch
DMem
Reg
DMem
Ifetch
Reg
DMem
Reg
DMem
Reg
ALU
sub r4,r1,r3
Reg
ALU
Ifetch
ALU
O
r
d
e
r
add r1,r2,r3
WB
ALU
I
n
s
t
r.
MEM
ALU
IF ID/RF EX
Ifetch
Reg
Ifetch
Reg
Reg
Reg
DMem
CS211 57
Reg
or
r8,r1,r9
xor r10,r1,r11
Reg
DMem
Ifetch
Reg
DMem
Reg
DMem
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
ALU
I
n
s
t
r.
ALU
Ifetch
Reg
Ifetch
Reg
Reg
Reg
DMem
Reg
CS211 58
38
Internal Data
Forwarding
Store-load forwarding
Memory
M
Memory
M
Access Unit
Access Unit
R1
STO M,R1
R2
LD R2,M
R1
STO M,R1
R2
MOVE R2,R1
CS211 60
39
Internal Data
Forwarding
Load-load forwarding
Memory
M
Memory
M
Access Unit
Access Unit
R1
LD R1,M
R2
LD R2,M
R1
LD R1,M
R2
MOVE R2,R1
CS211 61
40
Internal Data
Forwarding
Store-store forwarding
Memory
M
Memory
M
Access Unit
Access Unit
R1
STO M, R1
R2
STO M,R2
R1
R2
STO M,R2
CS211 62
NextPC
mux
MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
mux
Immediate
Data
Memory
CS211 63
op Rd Ra Rb
op Rd Ra Rb
Rd
Rd
Tricky situation:
R1 <- Mem[ R2 + I ]
Mem[R3+34] <- R1
Handle with bypass in memory stage!
Mem
T
to reg
file
CS211 64
and r6,r1,r7
or
r8,r1,r9
DMem
Ifetch
Reg
DMem
Reg
Ifetch
Ifetch
Reg
Reg
Reg
DMem
ALU
sub r4,r1,r6
Reg
ALU
O
r
d
e
r
ALU
I
n
s
t
r.
ALU
Reg
DMem
Reg
CS211 65
and r6,r1,r7
or r8,r1,r9
Reg
DMem
Ifetch
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
Reg
DMem
Reg
Reg
DMem
ALU
sub r4,r1,r6
Ifetch
ALU
O
r
d
e
r
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Reg
DMem
CS211 66
CS211 67
Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd
Fast code:
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd
CS211 68
Reg
DMem
Ifetch
Reg
Ifetch
Reg
Reg
DMem
Reg
DMem
Ifetch
Reg
ALU
r6,r1,r7
Ifetch
DMem
ALU
18: or
Reg
ALU
Ifetch
ALU
ALU
Reg
Reg
DMem
CS211 70
Reg
Instr. Decode
Reg. Fetch
Execute
Addr. Calc.
Memory
Access
Write
Back
CS211 72
CS211 73
CS211 74
CS211 75
Delayed Branch
Where to get instructions to fill branch delay slot?
CS211 76
Pipelinedepth
1+BranchfrequencyBranchpenalty
Scheduling
BranchCPIspeedup v.speedup v. scheme
penalty
unpipelined
stall
Stall pipeline
31.423.5
1.0
Predict taken
11.144.4
1.26
Predict not taken
11.094.5
1.29
Delayed branch
0.51.074.6
1.31
CS211 77
Summary :
Control and Pipelining
Just overlap tasks; easy if tasks are independent
Speed Up Pipeline Depth; if ideal CPI is 1, then:
Cycle Timeunpipelined
Pipeline depth
Speedup
CS211 79
Summary #1/2:
Pipelining
What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
Introduction to ILP
What is ILP?
Processor and Compiler design techniques that
speed up execution by causing individual machine
operations to execute in parallel
Hardware
FrontendandOptimizer
Superscalar
DetermineDependences
Dataflow
DetermineDependences
DetermineIndependences
Indep.Arch.
DetermineIndependences
BindOperationstoFunctionUnits
VLIW
BindOperationstoFunctionUnits
BindTransportstoBusses
BindTransportstoBusses
TTA
Execute
B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The
Journal of Supercomputing, 7(1-2):9-50, May 1993.
CS211 82