ILP-Architectures Part I

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 56

Embedded Computer Architecture

5SAI0

Instruction Level Parallel


Architectures
Part I
Henk Corporaal
www.ics.ele.tue.nl/~heco/courses/ECA
TUEindhoven
2020-2021
Topics on ILP architectures
• Introduction, Hazards (short recap)
• Out-Of-Order (OoO) execution (H&P sect 3.4-3.6):
– Dependences limit ILP: dynamic scheduling
– Hardware speculation
• Branch prediction (H&P sect 3.6 + 3.9)
• Multiple issue (H&P sect 3.7-3.8)
• How much ILP is there?

• Material Ch 3 (H&P or Dubois, second part)

07/20/2021 ECA H.Corporaal 2


What is ILP = Instruction level parallelism
• multiple operations (or instructions) can be executed in parallel, from a
single instruction stream
– so we are not yet talking about MIMD, multiple instruction streams

Needed:
• Sufficient (HW) resources
• Parallel scheduling
– Hardware solution
– Software solution
• Application should contain sufficient ILP

07/20/2021 ECA H.Corporaal 3


Single Issue RISC vs Superscalar
instr op instr op
instr op instr op
instr op instr op
instr op instr op
instr op instr op
instr op Change HW, instr op
instr op but can use instr op
instr op same code instr op
instr op instr op
instr op instr op
instr op instr op
instr op instr op
execute issue and (try to) execute
1 instr/cycle 3-issue Superscalar 3 instr/cycle

(1-issue)
RISC CPU
07/20/2021 ECA H.Corporaal 4
Recap (from pipelining lectures): Hazards
• Three types of hazards (see previous lecture)
– Structural
• Multiple instructions need access to the same hardware at the same time
– Data dependence
• There is a dependence between operands (in register or memory) of successive
instructions
– Control dependence
• Determines the order of the execution of basic blocks
• When jumping/branching to another address the pipeline has to be (partly)
squashed and refilled
• Hazards cause scheduling problems and delay the pipeline

07/20/2021 ECA H.Corporaal 5


Impact of Hazards
• Hazards cause pipeline 'bubbles'
Increase of CPI (and therefore execution time)

• Texec = Ninstr * CPI * Tcycle


where
 CPI = CPIbase + Σi <CPIhazard_i>
 <CPIhazard> = fhazard * <Cycle_penaltyhazard>
 fhazard = fraction [0..1] of occurrence of this hazard
07/20/2021 ECA H.Corporaal 6
Data dependences
(see earlier lectures for details & examples)

• RaW read after write


– real or flow dependence
– can only be avoided by value prediction (i.e. speculating on the outcome of a
previous operation)
• WaR write after read
• WaW write after write
– WaR and WaW are false or name dependencies
– Could be avoided by renaming (if sufficient registers are available); see later slide
Notes:
1. data dependences can be both between registers and memory data operations
2. data dependencies are shown in de DDG: Data Dependence Graph

07/20/2021 ECA H.Corporaal 7


Control Dependences: CFG
C input code: 1
sub t1, a, b
CFG
if (a > b) { r = a % b; } bgz t1, 2, 3 (Control Flow Graph):
else { r = b % a; }
y = a*b;
2 3
rem r, a, b rem r, b, a
goto 4 goto 4

4
mul y,a,b
…………..
Questions:
• How real are control dependences?
• Can ‘mul y,a,b’ be moved to block 2, 3 or even block 1?
• Can ‘rem r, a, b’ be moved to block 1 and executed speculatively?
07/20/2021 ECA H.Corporaal 8
Avoiding pipeline stalls due to Hazards
• Structural
– Buy more hardware
• Extra units, pipelined units, more ports on RF and data memory (or banked memories), etc.
– Note: more HW means bigger chip => could increase cycle time tcycle
• Data dependence
– Real (RaW) dependences: add Forwarding (aka Bypassing) logic
• Compiler optimizations
– False (WaR & WaW) dependences: use renaming (either in HW or in SW)
• Control dependence
– Branch prediction
– Avoiding Branches
– Adding pipeline HW to reduce the number of Branch delay slots
07/20/2021 ECA H.Corporaal 9
Topics on ILP architectures
• Introduction, Hazards (short recap)
• Out-Of-Order (OoO) execution:
– Dependences limit ILP: dynamic scheduling
– Hardware speculation
• Branch prediction
• Multiple issue
• How much ILP is there?

• Material Ch 3 (H&P or Dubois, second part)

07/20/2021 ECA H.Corporaal 10


Let's look at Out-of-Order execution:

07/20/2021 ECA H.Corporaal 11


Dynamic Scheduling Principle
• What we examined so far is static scheduling
– Compiler reorders instructions so as to avoid hazards and reduce stalls
• Dynamic scheduling:
hardware rearranges instruction execution to reduce stalls
• Example:
DIV.D F0,F2,F4 ; takes 24 cycles and
; is not pipelined
ADD.D
RaW; real dependence
F10,F0,F8

This instruction cannot continue


SUB.D F12,F8,F14 even though it does not depend
on previous Div and Add

• Key idea: Allow instructions behind stall to proceed


• Book describes Tomasulo algorithm in detail, but we first describe general idea

07/20/2021 ECA H.Corporaal 12


Advantages of Dynamic Scheduling
(compared to static scheduling, by compiler)
• Handles cases when dependences are unknown at compile time
– e.g., because they may involve a memory reference
• Simplifies the compiler (although a good compiler helps)

• Binary compatible: Allows code compiled for one machine to run


efficiently on a different machine, with different number of function units
(FUs), and different pipelining

• Allows hardware speculation, a technique with significant performance


advantages, that builds on dynamic scheduling
– Executing instructions before control dependend branch is handled
– I.e., we speculate on the branch outcome

07/20/2021 ECA H.Corporaal 13


Superscalar: General Architecture Concept
Instruction Instruction
Instruction
Memory Cache

Decoder

Reservation
Stations

Branch Logic & Load Store


ALU-1 ALU-2
Unit Shift Unit Unit

Address

Data
Data Cache Data

Reorder
Register
Buffer
File Data
Memory
07/20/2021 ECA H.Corporaal 14
Example of Superscalar Execution
• Superscalar processor organization, example:
– simple pipeline: IF, EX, WB
– fetches/issues upto 2 instructions each cycle (= 2-issue)
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7
L.D F6,32(R2)
L.D F2,48(R3)
MUL.D F0,F2,F4
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.D F6,F8,F2
MUL.D F12,F2,F4

07/20/2021 ECA H.Corporaal 15


Example of Superscalar Processor Execution
• Superscalar processor organization:
– simple pipeline: IF, EX, WB
– fetches 2 instructions each cycle
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF
L.D F2,48(R3) IF
MUL.D F0,F2,F4
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.D F6,F8,F2
MUL.D F12,F2,F4

07/20/2021 ECA H.Corporaal 16


Example of Superscalar Processor Execution
• Superscalar processor organization:
– simple pipeline: IF, EX, WB
– fetches 2 instructions each cycle
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX
L.D F2,48(R3) IF EX
MUL.D F0,F2,F4 IF
SUB.D F8,F2,F6 IF
DIV.D F10,F0,F6
ADD.D F6,F8,F2
MUL.D F12,F2,F4

07/20/2021 ECA H.Corporaal 17


Example of Superscalar Processor Execution
• Superscalar processor organization:
– simple pipeline: IF, EX, WB
– fetches 2 instructions each cycle
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
L.D F2,48(R3) IF EX WB
MUL.D F0,F2,F4 IF EX
SUB.D F8,F2,F6 IF EX
DIV.D F10,F0,F6 IF
ADD.D F6,F8,F2 IF
MUL.D F12,F2,F4

07/20/2021 ECA H.Corporaal 18


Example of Superscalar Processor Execution
• Superscalar processor organization:
– simple pipeline: IF, EX, WB
– fetches 2 instructions each cycle
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
L.D F2,48(R3) IF EX WB
MUL.D F0,F2,F4 IF EX EX
SUB.D F8,F2,F6 IF EX EX
DIV.D F10,F0,F6 IF
stall because
ADD.D F6,F8,F2 IF of data dep.
MUL.D F12,F2,F4
cannot be fetched because window full

07/20/2021 ECA H.Corporaal 19


Example of Superscalar Processor Execution
• Superscalar processor organization:
– simple pipeline: IF, EX, WB
– fetches 2 instructions each cycle
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
L.D F2,48(R3) IF EX WB
MUL.D F0,F2,F4 IF EX EX EX
SUB.D F8,F2,F6 IF EX EX WB
DIV.D F10,F0,F6 IF
ADD.D F6,F8,F2 IF EX
MUL.D F12,F2,F4 IF

07/20/2021 ECA H.Corporaal 20


Example of Superscalar Processor Execution
• Superscalar processor organization:
– simple pipeline: IF, EX, WB
– fetches 2 instructions each cycle
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
L.D F2,48(R3) IF EX WB
MUL.D F0,F2,F4 IF EX EX EX EX
SUB.D F8,F2,F6 IF EX EX WB
DIV.D F10,F0,F6 IF
ADD.D F6,F8,F2 IF EX EX
MUL.D F12,F2,F4 IF
cannot execute
structural hazard
07/20/2021 ECA H.Corporaal 21
Example of Superscalar Processor Execution
• Superscalar processor organization:
– simple pipeline: IF, EX, WB
– fetches 2 instructions each cycle
– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier
– Instruction window (buffer between IF and EX stage) is of size 2
– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7
L.D F6,32(R2) IF EX WB
L.D F2,48(R3) IF EX WB
MUL.D F0,F2,F4 IF EX EX EX EX WB
SUB.D F8,F2,F6 IF EX EX WB
DIV.D F10,F0,F6 IF EX
ADD.D F6,F8,F2 IF EX EX WB
MUL.D F12,F2,F4 IF ?

07/20/2021 ECA H.Corporaal 22


Superscalar: General Architectur Concept
Instruction Instruction
Instruction
Memory Cache

Decoder

Reservation
Stations

Branch Logic & Load Store


ALU-1 ALU-2
Unit Shift Unit Unit

Address

Data
Data Cache Data

Reorder
Register
Buffer
File Data
Memory
07/20/2021 ECA H.Corporaal 23
Superscalar Issues
1. How to fetch multiple instructions in time?
(especially across basic block boundaries)
2. Predicting branches
3. Non-blocking memory system
4. Tune #resources(FUs, ports, entries, etc.)
5. Handling dependencies
6. How to support precise interrupts?
7. How to recover from a mis-predicted branch path?
• For the latter two issues you may have a look at speculative and architectural state ,
– Processor Microarchitecture by Antonio González e.a., Morgan&Claypool 2011
– Superscalar Microprocessor Design, Mike Johnson, 1991 (or thesis 1989)

07/20/2021 ECA H.Corporaal 24


Beginning of dynamic scheduling
• 1967 @ IBM
– Robert Tomasulo
– Dynamic scheduling of Floating Point operations on multiple execution
units
– Allowing Out-Of-Order execution
– Register renaming in HW
• Only a few architectural visible Floating Point registers were available
• Renaming extends this number (although not architectural visible)
– Reservation stations in front of Execution Units
– CDB: Common Data Bus for all result broadcasts
• see https://en.wikipedia.org/wiki/Tomasulo_algorithm
07/20/2021 ECA H.Corporaal 25
Tomasulo’s Algorithm
• Top-level design:

• Note: Load and Store buffers contain data and addresses, act like reservation stations
07/20/2021 ECA H.Corporaal 26
Tomasulo’s Algorithm
3 basic steps:
• Issue
– Get next instruction from FIFO queue
– If available RS (reservation station), issue instruction to the RS with operand values if
available
– If operand values not available, stall the instruction
• Execute
– When operand becomes available, store it in all reservation stations waiting for it
– When all operands are ready, issue the instruction
– Loads and stores are maintained in program order
– No instruction allowed to initiate execution until all branches that proceed it in program
order have completed
• Write result
– Write result on CDB into reservation stations and store buffers
• (Stores must wait until address and value are received)
07/20/2021 ECA H.Corporaal 27
Register Renaming Technique
• Eliminate name/false dependencies:
– anti- (WaR) and output (WaW)) dependencies

• Can be implemented
– by the compiler
• advantage: low cost
• disadvantage: “old” codes perform poorly
– in hardware
• advantage: binary compatibility
• disadvantage: extra hardware needed

07/20/2021 ECA H.Corporaal 28


Register Renaming: general idea
• Example, look at F6: Question: how can this code
(optimally) be executed?
DIV.D F0, F2, F4
ADD.D F6, F0, F8
S.D F6, 0(R1) F6: RaW

SUB.D F8, F10, F14 F6: WaR F6: WaW

MUL.D F6, F10, F8

• False (aka “name”) dependences with F6


– anti: WaR and
– output: WaW in this example

07/20/2021 ECA H.Corporaal 29


Original code:
Register Renaming DIV.D F0, F2, F4
ADD.D F6, F0, F8
• Same example after renaming: S.D F6, 0(R1)
SUB.D F8, F10, F14
DIV.D R, F2, F4 MUL.D F6, F10, F8
ADD.D S, R, F8
S.D S, 0(R1)
SUB.D T, F10, F14
MUL.D U, F10, T

• Each destination gets a new (physical) register assigned


• Now only RaW hazards remain, which can be strictly ordered
• We will see several implementations of Register Renaming
1. use ReOrder Buffer (ROB) & Reservation Stations, or
2. use large register file with mapping table

07/20/2021 ECA H.Corporaal 30


Register Renaming with Tomasulo
• Register renaming can be provided by reservation stations (RS)
– Each RS entry contains 3 fields:
• Instruction
• Buffered operand values (when available)
• Reservation station number of instruction providing the operand values
• Operation of RS:
– RS fetches and buffers an operand as soon as it becomes available
• not necessarily involving register file
– Pending instructions designate the RS to which they will send their output
• Result values broadcast on a result bus, called the common data bus (CDB)
– Only the last output updates the register file
– As instructions are issued, the register-IDs are renamed with the reservation station
– We can have many more reservation stations than (architectural) registers
07/20/2021 ECA H.Corporaal 31
Example: snapshot, first L.D just finished

07/20/2021 ECA H.Corporaal 32


Speculation (Hardware based)
• Execute instructions along predicted execution paths but only commit the
results if the prediction was correct

• Instruction commit: allowing an instruction to update the register file when


instruction is no longer speculative

• Need an additional piece of hardware to prevent any irrevocable action until


an instruction commits:
– Reorder buffer, or Large renaming register file
– why? think about it?

07/20/2021 ECA H.Corporaal 33


OoO execution with speculation using RoB

NEW

07/20/2021 ECA H.Corporaal 34


Reorder Buffer (RoB)
• RoB – holds the result of instruction between completion and
commit
– Four fields per entry:
1. Instruction type: branch/store/register
2. Destination field: register number
3. Value field: output value
4. Ready field: completed execution? (is the data valid)

• Modify reservation stations:


– Operand source-id is now RoB (if its producer is still in the RoB) instead
of functional unit (check yourself!)

07/20/2021 ECA H.Corporaal 35


Reorder buffer

in operation

07/20/2021 ECA H.Corporaal 36


Reorder Buffer (RoB)
• Register values and memory values are not written until an
instruction commits

• RoB effectively renames the destination registers


– every destination gets a new entry in the RoB
• On misprediction:
– Speculated entries in RoB are cleared

• Exceptions:
– Not recognized/taken until it is ready to commit

07/20/2021 ECA H.Corporaal 37


Register renaming
• The RoB together with the Reservation station effectively
implement register renaming => avoiding WaW and WaR Hazards
when reordering code

• But...
– there is an alternative.....

07/20/2021 ECA H.Corporaal 38


Register Renaming using RAT
• there’s a physical register file larger than logical register file
• mapping table associates logical registers with physical register
– called RAT = Register Alias Table
• when an instruction is decoded:
– its physical source registers are obtained from mapping table
– its physical destination register is obtained from a free list
– mapping table is updated before:addi r3,r3,4 after: addi R2,R1,4
current new
r0 R8 r0 R8
mapping table: mapping table:
r1 R7 r1 R7

r2 R5 r2 R5
r3 R1 r3 R2
r4 R9 r4 R9

current free list: R2 R6 new free list: R6


07/20/2021 ECA H.Corporaal 39
Register Renaming using mapping table
Another example:
• Before (assume r0->R8, r1->R6, r2->R5, .. ):
• addi r1, r2, 1
• addi r2, r0, 0 // WaR
• addi r1, r2, 1 // WaW + RaW

• After (free list: R7, R9, R10)


• addi R7, R5, 1
• addi R10, R8, 0 // WaR disappeared
• addi R9, R10, 1 // WaW disappeared,
// RaW renamed to R10

07/20/2021 ECA H.Corporaal 40


Summary O-o-O architectures
• Instructions issue in-order, execute out-of-order,
and commit in-order !!
• Renaming avoids naming dependences
– via RoB: allocating an entry for every result, or
– via Register Mapping: Architectural registers are mapped on (many
more) Physical registers
• Speculation beyond branches
– branch prediction required (see next slides)
– recovery needed in case of miss prediction
• in-order-completion
• also needed for precise exception support
07/20/2021 ECA H.Corporaal 41
Topics on ILP architectures
• Introduction, Hazards (short recap)
• Out-Of-Order (OoO) execution:
• Branch prediction
– 1&2 bit prediction
– Branch correlation
– Avoiding branches: if-conversion
• Multiple issue
• How much ILP is there?
• Material Ch 3 (H&P or Dubois, second part)

07/20/2021 ECA H.Corporaal 42


Branch Prediction

• what is it?
• why do we need it?
• how does it work?
• various schemes

07/20/2021 ECA H.Corporaal 43


Branch Prediction Principles
breq r1, r2, label // if r1==r2
// then PCnext = label
// else PCnext = PC + 4 (for a RISC)

Questions:
• do I jump ? -> branch prediction
• where do I jump ? -> branch target prediction

• what's the average branch penalty?


– <CPIbranch_penalty>
– i.e. how many instruction slots do I miss (or squash) due to branches

07/20/2021 ECA H.Corporaal 44


Branch Prediction & Speculation
• High branch penalties in pipelined processors:
– With about 20% of the instructions being a branch, the maximum ILP is
five (but actually much less!)
• CPI = CPIbase + fbranch * fmisspredict * cycles_penalty
– Large impact if:
• Penalty high: long pipeline
• CPIbase low: for multiple-issue processors,
• Idea: predict the outcome of branches based on their history and
execute instructions at the predicted branch target speculatively

07/20/2021 ECA H.Corporaal 45


Branch Prediction Schemes
Predict branch direction:
• 1-bit Branch Prediction Buffer
• 2-bit Branch Prediction Buffer
• Correlating Branch Prediction Buffer

Predicting next address:


• Branch Target Buffer
• Return Address Predictors

+ Or: try to get rid (some) of those malicious branches

07/20/2021 ECA H.Corporaal 46


1-bit Branch Prediction Buffer
• 1-bit branch prediction buffer or branch history table (BHT):
PC 10…..10 101 00

BHT
0
1
k-bits 0
1 size=2k
0
1
1
0

• Buffer is like a cache without tags


• Does not help for our MIPS pipeline because target address calculations performed in
same stage as branch condition calculation

07/20/2021 ECA H.Corporaal 47


PC 10…..10 101 00
Two 1-bit predictor problems
BHT
0
1
k-bits 0
1 size=2k
0
1
1
• Aliasing: lower k bits of different branch instructions could be the same 0

– Solution: Use tags (the buffer becomes a tag); however very expensive
• Loops are predicted wrong twice
– Solution: Use n-bit saturation counter prediction
* taken if counter  2 (n-1)
* not-taken if counter < 2 (n-1)
– A 2-bit saturating counter predicts a loop wrong only once
07/20/2021 ECA H.Corporaal 48
2-bit Branch Prediction Buffer
• Solution: 2-bit scheme where prediction is changed only if mispredicted
twice
• Can be implemented as a saturating counter, e.g. as following state diagram:

T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
Taken T Taken

NT
07/20/2021 ECA H.Corporaal 49
Next step: Correlating Branches
• Fragment from SPEC92 benchmark eqntott:

if (aa==2) subi R3,R1,#2


aa = 0; b1: bnez R3,L1
if (bb==2) add R1,R0,R0
bb=0; L1: subi R3,R2,#2
if (aa!=bb){..} b2: bnez R3,L2
add R2,R0,R0
L2: sub R3,R1,R2
b3: beqz R3,L3

07/20/2021 ECA H.Corporaal 50


Correlating Branch Predictor 4 bits from branch address
Idea: behavior of current branch is related to
(taken/not taken) history of recently 2-bits per branch
executed branches local predictors
– Then behavior of recent branches selects
between, say, 4 predictions of next branch,
updating just that prediction

Prediction
Prediction
• (2,2) predictor: 2-bit global, 2-bit local

• (k,n) predictor uses behavior of last k


branches to choose from 2k predictors, each
of which is n-bit predictor
shift register,
remembers
last 2 branches
2-bit global
branch history register
(01 = not taken, then taken)
07/20/2021 ECA H.Corporaal 51
Branch Correlation: the General Scheme
• 4 parameters: (a, k, m, n)
Pattern History Table
2m-1

n-bit
m saturating
1 Up/Down
Counter Prediction
Branch Address 0

0 1 2k-1

a k

Branch History Table


(BHT)

Table size (usually n = 2): Nbits = k * 2a + 2k * 2m *n


• mostly n = 2
07/20/2021 ECA H.Corporaal 52
Two varieties
1. GA: Global history, a = 0
• only one (global) history register  correlation is with previously executed
branches (often different branches)
• Variant: Gshare (Scott McFarling’93): GA which takes logic OR of PC address bits
and branch history bits

2. PA: Per address history, a > 0


• if a large almost each branch has a separate history
• so we correlate with same branch

07/20/2021 ECA H.Corporaal 53


Accuracy, taking the best combination of parameters
(a, k, m, n) :
GA (0,11,5,2)

98
PA (10, 6, 4, 2)
Branch Prediction Accuracy (%)

97
96

95 Bimodal
94
GAs
93
PAs
92

91

89

64 128 256 1K 2K 4K 8K 16K 32K 64K


Predictor Size (bytes)
07/20/2021 ECA H.Corporaal 54
Branch Prediction; summary
• Basic 2-bit predictor:
– For each branch:
• Predict taken or not taken
• If the prediction is wrong two consecutive times, change prediction
• Correlating (global history) predictor:
– Multiple 2-bit predictors for each branch
– One for each possible combination of outcomes of preceding n branches
• Local predictor:
– Multiple 2-bit predictors for each branch
– One for each possible combination of outcomes for the last n occurrences of this
branch
• Tournament predictor:
– Combine correlating global predictor with local predictor

07/20/2021 ECA H.Corporaal 55


Next lectures
• Wednesday: GPUs
• Monday: ILP part II

07/20/2021 ECA H.Corporaal 56

You might also like