ILP-Architectures Part I
ILP-Architectures Part I
ILP-Architectures Part I
5SAI0
Needed:
• Sufficient (HW) resources
• Parallel scheduling
– Hardware solution
– Software solution
• Application should contain sufficient ILP
(1-issue)
RISC CPU
07/20/2021 ECA H.Corporaal 4
Recap (from pipelining lectures): Hazards
• Three types of hazards (see previous lecture)
– Structural
• Multiple instructions need access to the same hardware at the same time
– Data dependence
• There is a dependence between operands (in register or memory) of successive
instructions
– Control dependence
• Determines the order of the execution of basic blocks
• When jumping/branching to another address the pipeline has to be (partly)
squashed and refilled
• Hazards cause scheduling problems and delay the pipeline
4
mul y,a,b
…………..
Questions:
• How real are control dependences?
• Can ‘mul y,a,b’ be moved to block 2, 3 or even block 1?
• Can ‘rem r, a, b’ be moved to block 1 and executed speculatively?
07/20/2021 ECA H.Corporaal 8
Avoiding pipeline stalls due to Hazards
• Structural
– Buy more hardware
• Extra units, pipelined units, more ports on RF and data memory (or banked memories), etc.
– Note: more HW means bigger chip => could increase cycle time tcycle
• Data dependence
– Real (RaW) dependences: add Forwarding (aka Bypassing) logic
• Compiler optimizations
– False (WaR & WaW) dependences: use renaming (either in HW or in SW)
• Control dependence
– Branch prediction
– Avoiding Branches
– Adding pipeline HW to reduce the number of Branch delay slots
07/20/2021 ECA H.Corporaal 9
Topics on ILP architectures
• Introduction, Hazards (short recap)
• Out-Of-Order (OoO) execution:
– Dependences limit ILP: dynamic scheduling
– Hardware speculation
• Branch prediction
• Multiple issue
• How much ILP is there?
Decoder
Reservation
Stations
Address
Data
Data Cache Data
Reorder
Register
Buffer
File Data
Memory
07/20/2021 ECA H.Corporaal 14
Example of Superscalar Execution
• Superscalar processor organization, example:
– simple pipeline: IF, EX, WB
– fetches/issues upto 2 instructions each cycle (= 2-issue)
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7
L.D F6,32(R2)
L.D F2,48(R3)
MUL.D F0,F2,F4
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.D F6,F8,F2
MUL.D F12,F2,F4
Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF
L.D F2,48(R3) IF
MUL.D F0,F2,F4
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.D F6,F8,F2
MUL.D F12,F2,F4
Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX
L.D F2,48(R3) IF EX
MUL.D F0,F2,F4 IF
SUB.D F8,F2,F6 IF
DIV.D F10,F0,F6
ADD.D F6,F8,F2
MUL.D F12,F2,F4
Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
L.D F2,48(R3) IF EX WB
MUL.D F0,F2,F4 IF EX
SUB.D F8,F2,F6 IF EX
DIV.D F10,F0,F6 IF
ADD.D F6,F8,F2 IF
MUL.D F12,F2,F4
Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
L.D F2,48(R3) IF EX WB
MUL.D F0,F2,F4 IF EX EX
SUB.D F8,F2,F6 IF EX EX
DIV.D F10,F0,F6 IF
stall because
ADD.D F6,F8,F2 IF of data dep.
MUL.D F12,F2,F4
cannot be fetched because window full
Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
L.D F2,48(R3) IF EX WB
MUL.D F0,F2,F4 IF EX EX EX
SUB.D F8,F2,F6 IF EX EX WB
DIV.D F10,F0,F6 IF
ADD.D F6,F8,F2 IF EX
MUL.D F12,F2,F4 IF
Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
L.D F2,48(R3) IF EX WB
MUL.D F0,F2,F4 IF EX EX EX EX
SUB.D F8,F2,F6 IF EX EX WB
DIV.D F10,F0,F6 IF
ADD.D F6,F8,F2 IF EX EX
MUL.D F12,F2,F4 IF
cannot execute
structural hazard
07/20/2021 ECA H.Corporaal 21
Example of Superscalar Processor Execution
• Superscalar processor organization:
– simple pipeline: IF, EX, WB
– fetches 2 instructions each cycle
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
L.D F2,48(R3) IF EX WB
MUL.D F0,F2,F4 IF EX EX EX EX WB
SUB.D F8,F2,F6 IF EX EX WB
DIV.D F10,F0,F6 IF EX
ADD.D F6,F8,F2 IF EX EX WB
MUL.D F12,F2,F4 IF ?
Decoder
Reservation
Stations
Address
Data
Data Cache Data
Reorder
Register
Buffer
File Data
Memory
07/20/2021 ECA H.Corporaal 23
Superscalar Issues
1. How to fetch multiple instructions in time?
(especially across basic block boundaries)
2. Predicting branches
3. Non-blocking memory system
4. Tune #resources(FUs, ports, entries, etc.)
5. Handling dependencies
6. How to support precise interrupts?
7. How to recover from a mis-predicted branch path?
• For the latter two issues you may have a look at speculative and architectural state ,
– Processor Microarchitecture by Antonio González e.a., Morgan&Claypool 2011
– Superscalar Microprocessor Design, Mike Johnson, 1991 (or thesis 1989)
• Note: Load and Store buffers contain data and addresses, act like reservation stations
07/20/2021 ECA H.Corporaal 26
Tomasulo’s Algorithm
3 basic steps:
• Issue
– Get next instruction from FIFO queue
– If available RS (reservation station), issue instruction to the RS with operand values if
available
– If operand values not available, stall the instruction
• Execute
– When operand becomes available, store it in all reservation stations waiting for it
– When all operands are ready, issue the instruction
– Loads and stores are maintained in program order
– No instruction allowed to initiate execution until all branches that proceed it in program
order have completed
• Write result
– Write result on CDB into reservation stations and store buffers
• (Stores must wait until address and value are received)
07/20/2021 ECA H.Corporaal 27
Register Renaming Technique
• Eliminate name/false dependencies:
– anti- (WaR) and output (WaW)) dependencies
• Can be implemented
– by the compiler
• advantage: low cost
• disadvantage: “old” codes perform poorly
– in hardware
• advantage: binary compatibility
• disadvantage: extra hardware needed
NEW
in operation
• Exceptions:
– Not recognized/taken until it is ready to commit
• But...
– there is an alternative.....
r2 R5 r2 R5
r3 R1 r3 R2
r4 R9 r4 R9
• what is it?
• why do we need it?
• how does it work?
• various schemes
Questions:
• do I jump ? -> branch prediction
• where do I jump ? -> branch target prediction
BHT
0
1
k-bits 0
1 size=2k
0
1
1
0
– Solution: Use tags (the buffer becomes a tag); however very expensive
• Loops are predicted wrong twice
– Solution: Use n-bit saturation counter prediction
* taken if counter 2 (n-1)
* not-taken if counter < 2 (n-1)
– A 2-bit saturating counter predicts a loop wrong only once
07/20/2021 ECA H.Corporaal 48
2-bit Branch Prediction Buffer
• Solution: 2-bit scheme where prediction is changed only if mispredicted
twice
• Can be implemented as a saturating counter, e.g. as following state diagram:
T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
Taken T Taken
NT
07/20/2021 ECA H.Corporaal 49
Next step: Correlating Branches
• Fragment from SPEC92 benchmark eqntott:
Prediction
Prediction
• (2,2) predictor: 2-bit global, 2-bit local
n-bit
m saturating
1 Up/Down
Counter Prediction
Branch Address 0
0 1 2k-1
a k
98
PA (10, 6, 4, 2)
Branch Prediction Accuracy (%)
97
96
95 Bimodal
94
GAs
93
PAs
92
91
89