03 Dynamic Sched
03 Dynamic Sched
03 Dynamic Sched
S. G. Shivaprasad Yadav
B5000 very different model: HLL only, stack, Segmented VM IBM paper made case for ISAs good for microcoded processors leading to CISC Berkeley paper made the case for ISAs for pipeline + cache micrprocessors (VLSI) leading to RISC Who won RISC vs. CISC? VAX is dead. Intel 80x86 on desktop, RISC in embedded, Servers x86 and RISC
8/21/2011 2
Outline
ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards with Dynamic Scheduling (Start) Tomasulo Algorithm Conclusion
8/21/2011
8/21/2011
8/21/2011
To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks Simplest: loop-level parallelism to exploit parallelism among iterations of a loop. E.g., for (i=1; i<=1000; i=i+1) x[i] = x[i] + y[i];
8/21/2011 6
Loop-Level Parallelism
Exploit loop-level parallelism to parallelism by unrolling loop either by 1. dynamic via branch prediction or 2. static via loop unrolling by compiler (Another way is vectors, to be covered later) Determining instruction dependence is critical to Loop Level Parallelism To exploit ILP we must determine which instructions can be executed in parallel If 2 instructions are parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls (assuming no structural hazards) dependent, they are not parallel and must be executed in order, although they may often be partially overlapped
8/21/2011 7
8/21/2011
8/21/2011
Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is property of the pipeline Importance of the data dependencies
1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited
HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program
8/21/2011 11
Data Dependencies,Hazards
A dependence can be overcome in two different ways:
maintaining the dependence but avoiding a hazard, and eliminating a dependence by transforming the code.
Scheduling the code is the primary method used to avoid a hazard without altering a dependence, and such scheduling can be done both by the compiler and by the hardware.
8/21/2011
12
Data Hazards
A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap during execution would change the order of access to the operand involved in the dependence. Because of the dependence, we must preserve program order, that is, the order that the instructions would execute in if executed sequentially one at a time as determined by the original source program. The goal of both software and hardware techniques is to exploit parallelism by preserving program order only where it affects the outcome of the program. Detecting and avoiding hazards ensures that necessary program order is preserved. Data hazards may be classified as one of three types, depending on the order of read and write accesses in the instructions. Consider two instructions i and j, with i preceding j in program order. The possible data hazards are
RAW (read after write) WAW (write after write) WAR (write after read)
8/21/2011 15
Data Hazards
RAW (read after write)
j tries to read a source before i writes it, so j incorrectly gets the old value. This hazard is the most common type and corresponds to a true data dependence. Program order must be preserved to ensure that j receives the value from i.
Control Dependencies
A control dependence determines the ordering of an instruction, i, with respect to a branch instruction so that the instruction i is executed in correct program order and only when it should be. Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order One of the simplest examples of a control dependence is the dependence of the statements in the "then" part of an if statement on the branch. For example, in the code segment if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.
8/21/2011 17
Control Dependencies
In general, there are two constraints imposed by control dependences:
1. An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. For example, we cannot take an instruction from the then portion of an if statement and move it before the if statement. 2. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. For example, we cannot take a statement before the if statement and move it into the then portion.
8/21/2011
18
8/21/2011
19
Exception Behavior
Preserving exception behavior any changes in instruction execution order must not change how exceptions are raised in program ( no new exceptions) Example: DADDU R2,R3,R4 BEQZ R2,L1 LW R1,0(R2) L1:
(Assume branches not delayed)
8/21/2011
20
Data Flow
Data flow: actual flow of data values among instructions that produce results and those that consume them
branches make flow dynamic, determine which instruction is supplier of data
Example: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R5,R6 L: OR R7,R1,R8 OR depends on DADDU or DSUBU? Must preserve data flow on execution
8/21/2011 21
Outline
ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards with Dynamic Scheduling (Start) Tomasulo Algorithm Conclusion
8/21/2011
22
8/21/2011
23
Instruction producing result FP ALU op FP ALU op Load double Load double Integer op
Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op
Latency in cycles 4 3 1 1 1
8/21/2011
24
F0,0(R1) ;F0=vector element F4,F0,F2 ;add scalar from F2 0(R1),F4 ;store result R1,R1,-8 ;decrement pointer 8B (DW) R1,Loop ;branch R1!=zero
8/21/2011
25
0(R1),F4 ;store result R1,R1,-8 ;decrement pointer 8B (DW) ;assumes cant forward to branch R1,Loop ;branch R1!=zero Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1
26
6 7
7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead; How make faster?
8/21/2011 27
Loop Unrolling
A simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop termination code. Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows instructions from different iterations to be scheduled together. In this case, we can eliminate the data use stalls by creating additional independent instructions within the loop body. If we simply replicated the instructions when we unrolled the loop, the resulting use of the same registers could prevent us from effectively scheduling the loop. Thus, we will want to use different registers for each iteration, increasing the required number of registers.
8/21/2011 28
;alter to 4*8
For large values of n, most of the execution time will be spent in the unrolled loop
8/21/2011
30
8/21/2011
31
1. 2. 3. 4.
5. Schedule the code, preserving any dependences needed to yield the same result as the original code
8/21/2011 32
2. Growth in code size 3. Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling
If not be possible to allocate all live values to registers, may lose some or all of its advantage
Loop unrolling reduces impact of branches on pipeline; another way is branch prediction
8/21/2011
33
eq nt ot es t pr es so
co
hy
More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run:
8/21/2011
Misprediction Rate
9%
10%
ea r
2d
li
ss
p dl jd
do du
pr e
dr o
Integer
Floating Point
su
34
2c o
gc
8/21/2011
35
8/21/2011
36
Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit):
End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping
8/21/2011
37
T NT
00
Red: stop, not taken NT Green: go, taken Adds hysteresis to decision making process
8/21/2011 38
BHT Accuracy
Mispredict because either:
Wrong guess for that branch Got branch history of wrong branch when index the table
8/21/2011
39
Global Branch History: m-bit shift register keeping T/NT status of last m branches. Each entry in table has m n-bit predictors.
8/21/2011
40
Correlating Branches
(2,2) predictor
Behavior of recent branches selects between four predictions of next branch, updating just that prediction Branch address 4 2-bits per branch predictor
Prediction
8/21/2011
41
Frequency of Mispredictions
4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT
11%
6% 5%
6% 4%
6% 5%
0%
nasa7
8/21/2011
42
Tournament Predictors
Multilevel branch predictor Use n-bit saturating counter to choose between predictors Usual choice between global and local predictors
8/21/2011
43
Tournament Predictors
Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between: Global predictor
4K entries index by history of last 12 branches (212 = 4K) Each entry is a standard 2-bit predictor
Local predictor
Local history table: 1024 10-bit entries recording last 10 branches, index by branch address The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters
8/21/2011
44
8/21/2011
45
13 12
6% misprediction rate per branch SPECint (19% of INT instructions are branch) 2% misprediction rate per branch SPECfp (5% of FP instructions are branch)
9
11
SPECint2000
8/21/2011
16 8. w
17 1. s
SPECfp2000
46
Branch target calculation is costly and stalls the instruction fetch. BTB stores PCs the same way as caches The PC of a branch is sent to the BTB When a match is found the corresponding Predicted PC is returned If the branch was predicted taken, instruction fetch continues at the returned predicted PC
Outline
ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards with Dynamic Scheduling (Start) Tomasulo Algorithm Conclusion
8/21/2011
50
It allows code that compiled for one pipeline to run efficiently on a different pipeline It simplifies the compiler Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling (next lecture)
8/21/2011
51
Will distinguish when an instruction begins execution and when it completes execution; between 2 times, the instruction is in execution Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder
8/21/2011
52
8/21/2011
53
Goal: High Performance without special compilers Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations
This led Tomasulo to try to figure out how to get more effective registers renaming in hardware!
8/21/2011
54
Tomasulo Algorithm
Control & buffers distributed with Function Units (FU)
FU buffers called reservation stations; have pending operands
Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
Renaming avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers cant
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs
Avoids RAW hazards by executing an instruction only when its operands are available
Load and Stores treated as FUs with RSs as well Integer instructions can go past branches (predict taken), allowing FP ops beyond basic block in FP queue
8/21/2011 55
Tomasulo Organization
From Mem FP Op Queue Load Buffers FP Registers
Store Buffers
Mult1 Mult2
FP adders FP adders
Reservation Stations
8/21/2011
56
Busy: Indicates reservation station or FU is busy Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
8/21/2011 57
Normal data bus: data + destination (go to bus) Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast
Instruction stream
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6
Tomasulo Example
j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Instruction status:
Busy Address
No No No
3 Load/Buffers
Op S1 Vj S2 Vk RS Qj RS Qk
Reservation Stations:
Time Name Busy Add1 No Add2 No FU count Add3 No down Mult1 No Mult2 No
F0
F2
F4
F6
F8
F10
F12
...
F30
Busy Address
Yes No No 34+R2
Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
Load1
F8
F10
F12
...
F30
8/21/2011
60
Busy Address
Yes Yes No 34+R2 45+R3
Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
Load2
F4
F6
Load1
F8
F10
F12
...
F30
Busy Address
Yes Yes No 34+R2 45+R3
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes MULTD Mult2 No
S1 Vj
S2 Vk
RS Qj
RS Qk
R(F4) Load2
F0
F2
F4
F6
Load1
F8
F10
F12
...
F30
Mult1 Load2
Note: registers names are removed (renamed) in Reservation Stations; MULT issued Load1 completing; what is waiting for Load1?
8/21/2011
62
Busy Address
No Yes No 45+R3
Reservation Stations:
Time Name Busy Op Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No R(F4) Load2 Mult1 Yes MULTD Mult2 No
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 Load2
M(A1) Add1
Busy Address
No No No
Reservation Stations:
Time Name Busy Op 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Busy Address
No No No
Reservation Stations:
Time Name Busy Op 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
Add2
F8
F10
F12
...
F30
Mult1 M(A2)
Add1 Mult2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
Add2
F8
F10
F12
...
F30
Mult1 M(A2)
Add1 Mult2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
8/21/2011
67
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
8/21/2011
68
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
(M-M+M(M-M) Mult2
Write result of ADDD here? All quick instructions complete in this cycle!
8/21/2011 70
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 3 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
(M-M+M(M-M) Mult2
8/21/2011
71
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 2 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
(M-M+M(M-M) Mult2
8/21/2011
72
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 1 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
(M-M+M(M-M) Mult2
8/21/2011
73
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
(M-M+M(M-M) Mult2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
(M-M+M(M-M) Mult2
8/21/2011
76
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
(M-M+M(M-M) Mult2
8/21/2011
77
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
(M-M+M(M-M) Mult2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
(M-M+M(M-M) Result
Reservation stations
Permit instruction issue to advance past integer control flow operations Also buffer old values of registers - totally avoiding the WAR stall
Other perspective: Tomasulo building data flow dependency graph on the fly
8/21/2011
80
8/21/2011
81
Tomasulo Drawbacks
Complexity
delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon!
Many associative stores (CDB) at high speed Performance limited by Common Data Bus
Each CDB must go to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one! Multiple CDBs more FU logic for parallel assoc stores
Non-precise interrupts!
We will address this later
8/21/2011
82
And In Conclusion #1
Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP Dynamic HW exploiting ILP
Works when cant know dependence at compile time Can hide L1 cache misses Code for one machine runs well on another
8/21/2011
83
And In Conclusion #2
Reservations stations: renaming to larger set of registers + buffering source operands
Prevents registers as bottleneck Avoids WAR, WAW hazards Allows loop unrolling in HW
Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation