Chapter 3: Instruction-Level Parallelism
Chapter 3: Instruction-Level Parallelism
Chapter 3: Instruction-Level Parallelism
Parallelism
Instruction-Level Parallelism
Page 1
Improving Performance
Pipeline CPI
Effective CPI = Ideal CPI + Structural Stalls + RAW Stalls +
WAR Stalls + Control Stalls
Code Sequence
if (x < y)
A
else
B
C 4
Page 2
Basic Unit of Instruction-Level
Parallelism
• Sequence at instruction level is a basic block.
• Basic block (BB)
– Straight-line code with no branches into the sequence except
at the top and no branches out except at the bottom.
• Procedure represented as a control flow graph on
nodes that are basic blocks
Code Sequence
if (x < y)
A
else
B
C 5
Basic Blocks
Page 3
Basic Blocks
Page 4
Parallelism in Basic Blocks
Exploiting Parallelism
Page 5
Loop Unrolling
– Operation latency
– Finding independent instructions
11
12
Page 6
Straightforward DLX for Example
Loop: LD F0,0(R1) ; F0=array elem
ADDD F4,F0,F2 ; add scalar S
SD 0(R1),F4 ; store X[I]
SUBI R1,R1,8 ; decrement ptr
; double=8bytes
BNEZ R1,Loop ; branch R1!=0
Latency of Straightforward
Example
Loop: LD F0,0(R1) ; F0=array elem 1
stall 2
ADDD F4,F0,F2 ; add scalar S 3
stall 4
stall 5
SD 0(R1),F4 ; store X[I] 6
SUBI R1,R1,8 ; decrement ptr 7
stall 8
BNEZ R1,Loop ; branch R1!=0 9
stall 10
Page 7
Latency of Scheduled Example
Loop: LD F0,0(R1) ; F0=array elem 1
stall 2
SUBI R1,R1,8 ; decrement ptr 3
ADDD F4,F0,F2 ; add scalar S 4
BNEZ R1,Loop ; branch R1!=0 5
SD 8(R1),F4 ; store X[I] 6
15
Scheduled Example
Page 8
Unrolled Example
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
LD F6,-8(R1)
ADDD F8,F6,F2
SD -8(R1),F8
LD F10,-16(R1)
ADDD F12,F10,F2
SD -16(R1),F12
LD F14,-24(R1)
ADDD F16,F14,F2
SD -24(R1),F16
SUBI R1,R1,32
BNEZ R1,Loop 17
Unrolled Example
Loop: LD F0,0(R1)
Unroll factor is 4
ADDD F4,F0,F2
assuming that the
SD 0(R1),F4
LD F6,-8(R1)
trip count is a
ADDD F8,F6,F2 multiple of 32.
SD -8(R1),F8
LD F10,-16(R1) Register names
ADDD F12,F10,F2 changed to avoid
SD -16(R1),F12 dependences.
LD F14,-24(R1)
ADDD F16,F14,F2
Offsets adjusted for
SD -24(R1),F16
the change in the
SUBI R1,R1,32
BNEZ R1,Loop
iteration count. 18
Page 9
Unrolled Example
Loop: LD F0,0(R1)
Loop evaluation
ADDD F4,F0,F2
eliminated for 3
SD 0(R1),F4
LD F6,-8(R1)
iterations (removes
ADDD F8,F6,F2 6 instructions)
SD -8(R1),F8
LD F10,-16(R1)
ADDD F12,F10,F2
SD -16(R1),F12
LD F14,-24(R1)
ADDD F16,F14,F2
SD -24(R1),F16
SUBI R1,R1,32
BNEZ R1,Loop 19
Unrolled Example
Loop: LD F0,0(R1)
2 The loop takes 28
ADDD F4,F0,F2
4 cycles per iteration,
SD 0(R1),F4
LD F6,-8(R1)
or 7 cycles (28/4=7)
2
ADDD F8,F6,F2 per original loop body.
4
SD -8(R1),F8
LD F10,-16(R1) Branches removed, so
2
4
ADDD F12,F10,F2 multiple bodies can
SD -16(R1),F12 be scheduled together
LD F14,-24(R1)
2
4 ADDD F16,F14,F2
SD -24(R1),F16
2 SUBI R1,R1,32
2 BNEZ R1,Loop 20
Page 10
Scheduled Unrolled Example
Loop: LD F0,0(R1)
Independent iterations
LD F6,-8(R1)
LD F10,-16(R1) used to mask latency.
LD F14,-24(R1)
ADDD F4,F0,F2 The loop takes 14
ADDD F8,F6,F2 cycles per iteration,
ADDD F12,F10,F2 or 3.5 cycles per
ADDD F16,F14,F2 original loop body
SD 0(R1),F4
(14/4=3.5).
SD -8(R1),F8
SUBI R1,R1,32
Offset adjustments SD 16(R1),F12
SUBI moved above the
-32+16 = -16 BNEZ R1,Loop last two SDs, so the
-32+8 = -24 SD 8(R1),F16 offset is adjusted 21
Loop Unrolling
22
Page 11
Summary of Unrolling Example
23
Dependences
• Dependence types
– Data (dependence on values)
– Name (conflicts on names)
– Control (dynamic data flow dependent on branches)
24
Page 12
Data Dependences
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
...
SUBI R1,R1,8
…
BNEZ R1,Loop
25
Data Dependences
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
...
SUBI R1,R1,8
…
BNEZ R1,Loop
26
Page 13
Data Dependent Instructions
27
28
Page 14
Overcoming Data Dependences
Loop: ...
SUBI R1,R1,8
LD F6,0(R1)
ADDD F8,F6,F2
SD 0(R1),F8
SUBI R1,R1,8
...
29
30
Page 15
Overcoming Data Dependences
31
32
Page 16
Name Dependences
33
34
Page 17
Renaming in the Unroll Example
Loop: LD F0,0(R1)
Where are the name
ADDD F4,F0,F2
dependences in this
SD 0(R1),F4
LD F0,-8(R1)
loop?
ADDD F4,F0,F2
SD -8(R1),F4
LD F0,-16(R1)
ADDD F4,F0,F2
SD -16(R1),F4
LD F0,-24(R1)
ADDD F4,F0,F2
SD -24(R1),F4
SUBI R1,R1,32
BNEZ R1,Loop 35
Page 18
Renaming in the Unroll Example
Loop: LD F0,0(R1)
With registers renamed,
ADDD F4,F0,F2
the true dependences
SD 0(R1),F4
LD F6,-8(R1)
now constrain the
ADDD F8,F6,F2 scheduling of the loop.
SD -8(R1),F8
LD F10,-16(R1) Each loop body can
ADDD F12,F10,F2 be overlapped with the
SD -16(R1),F12 others.
LD F14,-24(R1)
ADDD F16,F14,F2
SD -24(R1),F16
SUBI R1,R1,32
BNEZ R1,Loop 37
Control Dependences
C1 C2 C
if (cond)
A; A,B,C are completely
if (cond2) A B independent - C could
B; depend on A or B and
C; would need an edge
return; return (A,C) or (B,C)
38
Page 19
Control Constraints
Control Constraints
Page 20
Control Constraints
42
Page 21
Overcoming Control Constraints
• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …
43
• Exceptions
BEQZ R2,L1 Hoist LW above the branch
LW R1,0(R2)
L1: …
44
Page 22
Overcoming Control Constraints
• Exceptions
BEQZ R2,L1 Hoist LW above the branch -
LW R1,0(R2) a new exception may result
L1: … since BEQZ guards R2
45
• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …
• Data flow
ADD R1,R2,R3
BEQZ R4,L0
SUB R1,R5,R6
L0: OR R7,R1,R8
46
Page 23
Overcoming Control Constraints
• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …
47
• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …
48
Page 24
Overcoming Control Constraints
• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …
• Sometimes it’s OK
ADD R1,R2,R3
Suppose R4 is dead
BEQZ R12,L0
at L0
SUB R4,R5,R6
ADD R5,R4,R9
L0: OR R7,R8,R9 R4 is dead here
50
Page 25
Overcoming Control Constraints
• Sometimes it’s OK
ADD R1,R2,R3
Suppose R4 is dead
BEQZ R12,L0
after L0 - then we
SUB R4,R5,R6
can move the SUB
ADD R5,R4,R9 above the branch
L0: OR R7,R8,R9
51
• Sometimes it’s OK
ADD R1,R2,R3
Suppose R4 is dead
BEQZ R12,L0
after L0 - then we
SUB R4,R5,R6
can move the SUB
ADD R5,R4,R9 above the branch
L0: OR R7,R8,R9
Page 26
Loop Carried Dependences
54
Page 27
Loop Carried Dependences
for (I=1; I<=100; I=I+1) {
A[I] = A[I] + B[I];
B[I+1] = C[I] + D[I];
}
55
Parallelized Loop
A[I] = A[1] + B[1];
for (I = 1; I <= 99; I+=1) {
B[I+1] = C[I] + D[I];
A[I+1] = A[I+1] + B[I+1];
}
B[101] = C[100] + D[100];
Page 28