ILP-Architectures Part I

Embedded Computer Architecture
5SAI0
Instruction Level Parallel

Architectures
Part I
Henk Corporaal
www.ics.ele.tue.nl/~heco/courses/ECA
TUEindhoven
2020-2021
Topics on ILP architectures
• Introduction, Hazards (short recap)
• Out-Of-Order (OoO) execution (H&P sect 3.4-3.6):
– Dependences limit ILP: dynamic scheduling
– Hardware speculation
• Branch prediction (H&P sect 3.6 + 3.9)
• Multiple issue (H&P sect 3.7-3.8)
• How much ILP is there?
• Material Ch 3 (H&P or Dubois, second part)
07/20/2021 ECA H.Corporaal 2

What is ILP = Instruction level parallelism
• multiple operations (or instructions) can be executed in parallel, from a
single instruction stream
– so we are not yet talking about MIMD, multiple instruction streams
Needed:
• Sufficient (HW) resources
• Parallel scheduling
– Hardware solution
– Software solution
• Application should contain sufficient ILP

Single Issue RISC vs Superscalar
instr op instr op
instr op instr op
instr op instr op
instr op instr op
instr op instr op
instr op Change HW, instr op
instr op but can use instr op
instr op same code instr op
instr op instr op
instr op instr op
instr op instr op
instr op instr op
execute issue and (try to) execute
1 instr/cycle 3-issue Superscalar 3 instr/cycle
(1-issue)
RISC CPU
Recap (from pipelining lectures): Hazards
• Three types of hazards (see previous lecture)
– Structural
• Multiple instructions need access to the same hardware at the same time
– Data dependence
• There is a dependence between operands (in register or memory) of successive
instructions
– Control dependence
• Determines the order of the execution of basic blocks
• When jumping/branching to another address the pipeline has to be (partly)
squashed and refilled
• Hazards cause scheduling problems and delay the pipeline

Impact of Hazards
• Hazards cause pipeline 'bubbles'
Increase of CPI (and therefore execution time)
• Texec = Ninstr * CPI * Tcycle

where
 CPI = CPIbase + Σi <CPIhazard_i>
 <CPIhazard> = fhazard * <Cycle_penaltyhazard>
 fhazard = fraction [0..1] of occurrence of this hazard
Data dependences
(see earlier lectures for details & examples)
• RaW read after write

– real or flow dependence
– can only be avoided by value prediction (i.e. speculating on the outcome of a
previous operation)
• WaR write after read
• WaW write after write
– WaR and WaW are false or name dependencies
– Could be avoided by renaming (if sufficient registers are available); see later slide
Notes:
1. data dependences can be both between registers and memory data operations
2. data dependencies are shown in de DDG: Data Dependence Graph

Control Dependences: CFG
C input code: 1
sub t1, a, b
CFG
if (a > b) { r = a % b; } bgz t1, 2, 3 (Control Flow Graph):
else { r = b % a; }
y = a*b;
2 3
rem r, a, b rem r, b, a
goto 4 goto 4
4
mul y,a,b
…………..
Questions:
• How real are control dependences?
• Can ‘mul y,a,b’ be moved to block 2, 3 or even block 1?
• Can ‘rem r, a, b’ be moved to block 1 and executed speculatively?
Avoiding pipeline stalls due to Hazards
• Structural
– Buy more hardware
• Extra units, pipelined units, more ports on RF and data memory (or banked memories), etc.
– Note: more HW means bigger chip => could increase cycle time tcycle
• Data dependence
– Real (RaW) dependences: add Forwarding (aka Bypassing) logic
• Compiler optimizations
– False (WaR & WaW) dependences: use renaming (either in HW or in SW)
• Control dependence
– Branch prediction
– Avoiding Branches
– Adding pipeline HW to reduce the number of Branch delay slots
• Out-Of-Order (OoO) execution:
– Dependences limit ILP: dynamic scheduling
– Hardware speculation
• Branch prediction
• Multiple issue

Let's look at Out-of-Order execution:

Dynamic Scheduling Principle
• What we examined so far is static scheduling
– Compiler reorders instructions so as to avoid hazards and reduce stalls
• Dynamic scheduling:
hardware rearranges instruction execution to reduce stalls
• Example:
DIV.D F0,F2,F4 ; takes 24 cycles and
; is not pipelined
ADD.D
RaW; real dependence
F10,F0,F8
This instruction cannot continue

SUB.D F12,F8,F14 even though it does not depend
on previous Div and Add
• Key idea: Allow instructions behind stall to proceed

• Book describes Tomasulo algorithm in detail, but we first describe general idea

Advantages of Dynamic Scheduling
(compared to static scheduling, by compiler)
• Handles cases when dependences are unknown at compile time
– e.g., because they may involve a memory reference
• Simplifies the compiler (although a good compiler helps)
• Binary compatible: Allows code compiled for one machine to run

efficiently on a different machine, with different number of function units
(FUs), and different pipelining
• Allows hardware speculation, a technique with significant performance

advantages, that builds on dynamic scheduling
– Executing instructions before control dependend branch is handled
– I.e., we speculate on the branch outcome

Superscalar: General Architecture Concept
Instruction Instruction
Instruction
Memory Cache
Decoder
Reservation
Stations
Branch Logic & Load Store

ALU-1 ALU-2
Unit Shift Unit Unit
Address
Data
Data Cache Data
Reorder
Register
Buffer
File Data
Memory
Example of Superscalar Execution
• Superscalar processor organization, example:
– simple pipeline: IF, EX, WB
– fetches/issues upto 2 instructions each cycle (= 2-issue)
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7
L.D F6,32(R2)
L.D F2,48(R3)
MUL.D F0,F2,F4
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.D F6,F8,F2
MUL.D F12,F2,F4

Example of Superscalar Processor Execution
• Superscalar processor organization:
– fetches 2 instructions each cycle
Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF
L.D F2,48(R3) IF
MUL.D F0,F2,F4
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.D F6,F8,F2
MUL.D F12,F2,F4

Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX
L.D F2,48(R3) IF EX
MUL.D F0,F2,F4 IF
SUB.D F8,F2,F6 IF
DIV.D F10,F0,F6
ADD.D F6,F8,F2
MUL.D F12,F2,F4

Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
MUL.D F0,F2,F4 IF EX
SUB.D F8,F2,F6 IF EX
DIV.D F10,F0,F6 IF
ADD.D F6,F8,F2 IF
MUL.D F12,F2,F4

Cycle 1 2 3 4 5 6 7
MUL.D F0,F2,F4 IF EX EX
SUB.D F8,F2,F6 IF EX EX
DIV.D F10,F0,F6 IF
stall because
ADD.D F6,F8,F2 IF of data dep.
MUL.D F12,F2,F4
cannot be fetched because window full

Cycle 1 2 3 4 5 6 7
MUL.D F0,F2,F4 IF EX EX EX
SUB.D F8,F2,F6 IF EX EX WB
DIV.D F10,F0,F6 IF
ADD.D F6,F8,F2 IF EX
MUL.D F12,F2,F4 IF

Cycle 1 2 3 4 5 6 7
MUL.D F0,F2,F4 IF EX EX EX EX
DIV.D F10,F0,F6 IF
ADD.D F6,F8,F2 IF EX EX
MUL.D F12,F2,F4 IF
cannot execute
structural hazard
Cycle 1 2 3 4 5 6 7
MUL.D F0,F2,F4 IF EX EX EX EX WB
DIV.D F10,F0,F6 IF EX
ADD.D F6,F8,F2 IF EX EX WB
MUL.D F12,F2,F4 IF ?

Superscalar: General Architectur Concept
Instruction Instruction
Instruction
Memory Cache
Decoder
Reservation
Stations
Branch Logic & Load Store

ALU-1 ALU-2
Unit Shift Unit Unit
Address
Data
Data Cache Data
Reorder
Register
Buffer
File Data
Memory
Superscalar Issues
1. How to fetch multiple instructions in time?
(especially across basic block boundaries)
2. Predicting branches
3. Non-blocking memory system
4. Tune #resources(FUs, ports, entries, etc.)
5. Handling dependencies
6. How to support precise interrupts?
7. How to recover from a mis-predicted branch path?
• For the latter two issues you may have a look at speculative and architectural state ,
– Processor Microarchitecture by Antonio González e.a., Morgan&Claypool 2011
– Superscalar Microprocessor Design, Mike Johnson, 1991 (or thesis 1989)

Beginning of dynamic scheduling
• 1967 @ IBM
– Robert Tomasulo
– Dynamic scheduling of Floating Point operations on multiple execution
units
– Allowing Out-Of-Order execution
– Register renaming in HW
• Only a few architectural visible Floating Point registers were available
• Renaming extends this number (although not architectural visible)
– Reservation stations in front of Execution Units
– CDB: Common Data Bus for all result broadcasts
• see https://en.wikipedia.org/wiki/Tomasulo_algorithm
Tomasulo’s Algorithm
• Top-level design:
• Note: Load and Store buffers contain data and addresses, act like reservation stations
Tomasulo’s Algorithm
3 basic steps:
• Issue
– Get next instruction from FIFO queue
– If available RS (reservation station), issue instruction to the RS with operand values if
available
– If operand values not available, stall the instruction
• Execute
– When operand becomes available, store it in all reservation stations waiting for it
– When all operands are ready, issue the instruction
– Loads and stores are maintained in program order
– No instruction allowed to initiate execution until all branches that proceed it in program
order have completed
• Write result
– Write result on CDB into reservation stations and store buffers
• (Stores must wait until address and value are received)
Register Renaming Technique
• Eliminate name/false dependencies:
– anti- (WaR) and output (WaW)) dependencies
• Can be implemented
– by the compiler
• advantage: low cost
• disadvantage: “old” codes perform poorly
– in hardware
• advantage: binary compatibility
• disadvantage: extra hardware needed

Register Renaming: general idea
• Example, look at F6: Question: how can this code
(optimally) be executed?
DIV.D F0, F2, F4
ADD.D F6, F0, F8
S.D F6, 0(R1) F6: RaW
SUB.D F8, F10, F14 F6: WaR F6: WaW
MUL.D F6, F10, F8
• False (aka “name”) dependences with F6

– anti: WaR and
– output: WaW in this example

Original code:
Register Renaming DIV.D F0, F2, F4
ADD.D F6, F0, F8
• Same example after renaming: S.D F6, 0(R1)
SUB.D F8, F10, F14
DIV.D R, F2, F4 MUL.D F6, F10, F8
ADD.D S, R, F8
S.D S, 0(R1)
SUB.D T, F10, F14
MUL.D U, F10, T
• Each destination gets a new (physical) register assigned

• Now only RaW hazards remain, which can be strictly ordered
• We will see several implementations of Register Renaming
1. use ReOrder Buffer (ROB) & Reservation Stations, or
2. use large register file with mapping table

Register Renaming with Tomasulo
• Register renaming can be provided by reservation stations (RS)
– Each RS entry contains 3 fields:
• Instruction
• Buffered operand values (when available)
• Reservation station number of instruction providing the operand values
• Operation of RS:
– RS fetches and buffers an operand as soon as it becomes available
• not necessarily involving register file
– Pending instructions designate the RS to which they will send their output
• Result values broadcast on a result bus, called the common data bus (CDB)
– Only the last output updates the register file
– As instructions are issued, the register-IDs are renamed with the reservation station
– We can have many more reservation stations than (architectural) registers
Example: snapshot, first L.D just finished

Speculation (Hardware based)
• Execute instructions along predicted execution paths but only commit the
results if the prediction was correct
• Instruction commit: allowing an instruction to update the register file when

instruction is no longer speculative
• Need an additional piece of hardware to prevent any irrevocable action until

an instruction commits:
– Reorder buffer, or Large renaming register file
– why? think about it?

OoO execution with speculation using RoB
NEW

Reorder Buffer (RoB)
• RoB – holds the result of instruction between completion and
commit
– Four fields per entry:
1. Instruction type: branch/store/register
2. Destination field: register number
3. Value field: output value
4. Ready field: completed execution? (is the data valid)
• Modify reservation stations:

– Operand source-id is now RoB (if its producer is still in the RoB) instead
of functional unit (check yourself!)

Reorder buffer
in operation

Reorder Buffer (RoB)
• Register values and memory values are not written until an
instruction commits
• RoB effectively renames the destination registers

– every destination gets a new entry in the RoB
• On misprediction:
– Speculated entries in RoB are cleared
• Exceptions:
– Not recognized/taken until it is ready to commit

Register renaming
• The RoB together with the Reservation station effectively
implement register renaming => avoiding WaW and WaR Hazards
when reordering code
• But...
– there is an alternative.....

Register Renaming using RAT
• there’s a physical register file larger than logical register file
• mapping table associates logical registers with physical register
– called RAT = Register Alias Table
• when an instruction is decoded:
– its physical source registers are obtained from mapping table
– its physical destination register is obtained from a free list
– mapping table is updated before:addi r3,r3,4 after: addi R2,R1,4
current new
r0 R8 r0 R8
mapping table: mapping table:
r1 R7 r1 R7
r2 R5 r2 R5
r3 R1 r3 R2
r4 R9 r4 R9
current free list: R2 R6 new free list: R6

Register Renaming using mapping table
Another example:
• Before (assume r0->R8, r1->R6, r2->R5, .. ):
• addi r1, r2, 1
• addi r2, r0, 0 // WaR
• addi r1, r2, 1 // WaW + RaW
• After (free list: R7, R9, R10)

• addi R7, R5, 1
• addi R10, R8, 0 // WaR disappeared
• addi R9, R10, 1 // WaW disappeared,
// RaW renamed to R10

Summary O-o-O architectures
• Instructions issue in-order, execute out-of-order,
and commit in-order !!
• Renaming avoids naming dependences
– via RoB: allocating an entry for every result, or
– via Register Mapping: Architectural registers are mapped on (many
more) Physical registers
• Speculation beyond branches
– branch prediction required (see next slides)
– recovery needed in case of miss prediction
• in-order-completion
• also needed for precise exception support
• Out-Of-Order (OoO) execution:
• Branch prediction
– 1&2 bit prediction
– Branch correlation
– Avoiding branches: if-conversion
• Multiple issue

Branch Prediction
• what is it?
• why do we need it?
• how does it work?
• various schemes

Branch Prediction Principles
breq r1, r2, label // if r1==r2
// then PCnext = label
// else PCnext = PC + 4 (for a RISC)
Questions:
• do I jump ? -> branch prediction
• where do I jump ? -> branch target prediction
• what's the average branch penalty?

– <CPIbranch_penalty>
– i.e. how many instruction slots do I miss (or squash) due to branches

Branch Prediction & Speculation
• High branch penalties in pipelined processors:
– With about 20% of the instructions being a branch, the maximum ILP is
five (but actually much less!)
• CPI = CPIbase + fbranch * fmisspredict * cycles_penalty
– Large impact if:
• Penalty high: long pipeline
• CPIbase low: for multiple-issue processors,
• Idea: predict the outcome of branches based on their history and
execute instructions at the predicted branch target speculatively

Branch Prediction Schemes
Predict branch direction:
• 1-bit Branch Prediction Buffer
• 2-bit Branch Prediction Buffer
• Correlating Branch Prediction Buffer
Predicting next address:

• Branch Target Buffer
• Return Address Predictors
+ Or: try to get rid (some) of those malicious branches

1-bit Branch Prediction Buffer
• 1-bit branch prediction buffer or branch history table (BHT):
PC 10…..10 101 00
BHT
0
1
k-bits 0
1 size=2k
0
1
1
0
• Buffer is like a cache without tags

• Does not help for our MIPS pipeline because target address calculations performed in
same stage as branch condition calculation

PC 10…..10 101 00
Two 1-bit predictor problems
BHT
0
1
k-bits 0
1 size=2k
0
1
1
• Aliasing: lower k bits of different branch instructions could be the same 0
– Solution: Use tags (the buffer becomes a tag); however very expensive
• Loops are predicted wrong twice
– Solution: Use n-bit saturation counter prediction
* taken if counter  2 (n-1)
* not-taken if counter < 2 (n-1)
– A 2-bit saturating counter predicts a loop wrong only once
2-bit Branch Prediction Buffer
• Solution: 2-bit scheme where prediction is changed only if mispredicted
twice
• Can be implemented as a saturating counter, e.g. as following state diagram:
T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
Taken T Taken
NT
Next step: Correlating Branches
• Fragment from SPEC92 benchmark eqntott:
if (aa==2) subi R3,R1,#2

aa = 0; b1: bnez R3,L1
if (bb==2) add R1,R0,R0
bb=0; L1: subi R3,R2,#2
if (aa!=bb){..} b2: bnez R3,L2
add R2,R0,R0
L2: sub R3,R1,R2
b3: beqz R3,L3

Correlating Branch Predictor 4 bits from branch address
Idea: behavior of current branch is related to
(taken/not taken) history of recently 2-bits per branch
executed branches local predictors
– Then behavior of recent branches selects
between, say, 4 predictions of next branch,
updating just that prediction
Prediction
Prediction
• (2,2) predictor: 2-bit global, 2-bit local
• (k,n) predictor uses behavior of last k

branches to choose from 2k predictors, each
of which is n-bit predictor
shift register,
remembers
last 2 branches
2-bit global
branch history register
(01 = not taken, then taken)
Branch Correlation: the General Scheme
• 4 parameters: (a, k, m, n)
Pattern History Table
2m-1
n-bit
m saturating
1 Up/Down
Counter Prediction
Branch Address 0
0 1 2k-1
a k
Branch History Table

(BHT)
Table size (usually n = 2): Nbits = k * 2a + 2k * 2m *n

• mostly n = 2
Two varieties
1. GA: Global history, a = 0
• only one (global) history register  correlation is with previously executed
branches (often different branches)
• Variant: Gshare (Scott McFarling’93): GA which takes logic OR of PC address bits
and branch history bits
2. PA: Per address history, a > 0

• if a large almost each branch has a separate history
• so we correlate with same branch

Accuracy, taking the best combination of parameters
(a, k, m, n) :
GA (0,11,5,2)
98
PA (10, 6, 4, 2)
Branch Prediction Accuracy (%)
97
96
95 Bimodal
94
GAs
93
PAs
92
91
89
64 128 256 1K 2K 4K 8K 16K 32K 64K

Predictor Size (bytes)
Branch Prediction; summary
• Basic 2-bit predictor:
– For each branch:
• Predict taken or not taken
• If the prediction is wrong two consecutive times, change prediction
• Correlating (global history) predictor:
– Multiple 2-bit predictors for each branch
– One for each possible combination of outcomes of preceding n branches
• Local predictor:
– Multiple 2-bit predictors for each branch
– One for each possible combination of outcomes for the last n occurrences of this
branch
• Tournament predictor:
– Combine correlating global predictor with local predictor

Next lectures
• Wednesday: GPUs
• Monday: ILP part II

ILP-Architectures Part I

Uploaded by

Copyright:

Available Formats

ILP-Architectures Part I

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ILP-Architectures Part I

Uploaded by

Copyright:

Available Formats

Embedded Computer Architecture

Instruction Level Parallel

• Material Ch 3 (H&P or Dubois, second part)

07/20/2021 ECA H.Corporaal 2

07/20/2021 ECA H.Corporaal 3

07/20/2021 ECA H.Corporaal 5

• Texec = Ninstr * CPI * Tcycle

• RaW read after write

07/20/2021 ECA H.Corporaal 7

• Material Ch 3 (H&P or Dubois, second part)

07/20/2021 ECA H.Corporaal 10

07/20/2021 ECA H.Corporaal 11

This instruction cannot continue

• Key idea: Allow instructions behind stall to proceed

07/20/2021 ECA H.Corporaal 12

• Binary compatible: Allows code compiled for one machine to run

• Allows hardware speculation, a technique with significant performance

07/20/2021 ECA H.Corporaal 13

Branch Logic & Load Store

07/20/2021 ECA H.Corporaal 15

07/20/2021 ECA H.Corporaal 16

07/20/2021 ECA H.Corporaal 17

07/20/2021 ECA H.Corporaal 18

07/20/2021 ECA H.Corporaal 19

07/20/2021 ECA H.Corporaal 20

07/20/2021 ECA H.Corporaal 22

Branch Logic & Load Store

07/20/2021 ECA H.Corporaal 24

07/20/2021 ECA H.Corporaal 28

SUB.D F8, F10, F14 F6: WaR F6: WaW

MUL.D F6, F10, F8

• False (aka “name”) dependences with F6

07/20/2021 ECA H.Corporaal 29

• Each destination gets a new (physical) register assigned

07/20/2021 ECA H.Corporaal 30

07/20/2021 ECA H.Corporaal 32

• Instruction commit: allowing an instruction to update the register file when

• Need an additional piece of hardware to prevent any irrevocable action until

07/20/2021 ECA H.Corporaal 33

07/20/2021 ECA H.Corporaal 34

• Modify reservation stations:

07/20/2021 ECA H.Corporaal 35

07/20/2021 ECA H.Corporaal 36

• RoB effectively renames the destination registers

07/20/2021 ECA H.Corporaal 37

07/20/2021 ECA H.Corporaal 38

current free list: R2 R6 new free list: R6

• After (free list: R7, R9, R10)

07/20/2021 ECA H.Corporaal 40

07/20/2021 ECA H.Corporaal 42

07/20/2021 ECA H.Corporaal 43

• what's the average branch penalty?

07/20/2021 ECA H.Corporaal 44

07/20/2021 ECA H.Corporaal 45

Predicting next address:

+ Or: try to get rid (some) of those malicious branches

07/20/2021 ECA H.Corporaal 46

• Buffer is like a cache without tags

07/20/2021 ECA H.Corporaal 47

if (aa==2) subi R3,R1,#2