Chapter 3: Instruction-Level Parallelism

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Chapter 3: Instruction-Level

Parallelism

Instruction-Level Parallelism

• Straightforward pipelining overlaps the execution


of multiple independent instructions (processing
overlapped across pipeline stages).

• Instruction-level parallelism: Exploiting


parallelism in instructions to improve
performance.

• Generally invisible to the programmer (but not


the compiler!)

Page 1
Improving Performance

Pipeline CPI
Effective CPI = Ideal CPI + Structural Stalls + RAW Stalls +
WAR Stalls + Control Stalls

To improve performance, we can tackle any one of


these terms.

Control stalls Loop unrolling, dynamic branch prediction, speculation


RAW stalls Scheduling, scoreboarding, memory disambiguation
WAR stalls Scheduling with register renaming
Data stalls Dependence analysis, software pipelining, speculation
Ideal CPI Multiple issue, dependence analysis, software pipelining

Basic Unit of Instruction-Level


Parallelism
• Sequence at instruction level is a basic block.
• Basic block (BB)
– Straight-line code with no branches into the sequence except
at the top and no branches out except at the bottom.
• Procedure represented as a control flow graph on
nodes that are basic blocks

Code Sequence
if (x < y)
A
else
B
C 4

Page 2
Basic Unit of Instruction-Level
Parallelism
• Sequence at instruction level is a basic block.
• Basic block (BB)
– Straight-line code with no branches into the sequence except
at the top and no branches out except at the bottom.
• Procedure represented as a control flow graph on
nodes that are basic blocks

Code Sequence
if (x < y)
A
else
B
C 5

Basic Blocks

L0: ADD r1,r2,r3


SUB r4,r1,r5
BEQZ r4,L0
ADD r3,r2,r1
BNEZ r3,L1
OR r8,r10,r11
L1: AND r5,r6,r7
OR r3,r2,r1
BEQZ r5,L2

Page 3
Basic Blocks

L0: ADD r1,r2,r3 BB #1


SUB r4,r1,r5
BEQZ r4,L0
ADD r3,r2,r1 BB #2
BNEZ r3,L1
OR r8,r10,r11 BB #3
L1: AND r5,r6,r7 BB #4
OR r3,r2,r1
BEQZ r5,L2

Control Flow Graph

L0: ADD r1,r2,r3 BB #1


SUB r4,r1,r5
BEQZ r4,L0
ADD r3,r2,r1 BB #2
BNEZ r3,L1
OR r8,r10,r11 BB #3
L1: AND r5,r6,r7 BB #4
OR r3,r2,r1
BEQZ r5,L2

Page 4
Parallelism in Basic Blocks

• Amount of parallelism within a BB is relatively


small.

– Average BB size is 4-6 instructions


– Parallelism less b/c instructions are typically interdependent

• Consider branches 20% of instruction mix. How


big is the typical basic block??

20%= 1/5 instructions is a branch ® 5

Exploiting Parallelism

• BB is too small and interdependent instructions.


• Exploit parallelism across basic blocks

• ILP frequently found in loops, across iterations of


the loop. E.g.:

for (I=1; I<=1000; I+=1)


x[I] = x[I] + y[I];

• Every iteration is independent and can be fully


overlapped with every other one
10

Page 5
Loop Unrolling

• Keep pipeline full with independent instructions


that can execute simultaneously

– Operation latency
– Finding independent instructions

• Expose parallelism across loop iterations

– Compiler pipeline scheduling


– Hardware dynamic scheduling

11

Loop Unrolling Example

• DLX pipeline with latencies

FP ALU op ® FP ALU op 3 cycles


FP ALU op ® Store double 2 cycles
Load double ® FP ALU op 1 cycle
Load double ® Store double 0 cycle

• Simple loop (array X and scalar S are doubles)

for (I = 1; I < 1000; I++)


x[I] = x[I] + s;

12

Page 6
Straightforward DLX for Example
Loop: LD F0,0(R1) ; F0=array elem
ADDD F4,F0,F2 ; add scalar S
SD 0(R1),F4 ; store X[I]
SUBI R1,R1,8 ; decrement ptr
; double=8bytes
BNEZ R1,Loop ; branch R1!=0

• R1 address of array (highest element), F2


contains the scalar S, and &X[1]=8.

• How fast does this code run on DLX without


scheduling?
13

Latency of Straightforward
Example
Loop: LD F0,0(R1) ; F0=array elem 1
stall 2
ADDD F4,F0,F2 ; add scalar S 3
stall 4
stall 5
SD 0(R1),F4 ; store X[I] 6
SUBI R1,R1,8 ; decrement ptr 7
stall 8
BNEZ R1,Loop ; branch R1!=0 9
stall 10

• Total of 5 stall cycles


14

Page 7
Latency of Scheduled Example
Loop: LD F0,0(R1) ; F0=array elem 1
stall 2
SUBI R1,R1,8 ; decrement ptr 3
ADDD F4,F0,F2 ; add scalar S 4
BNEZ R1,Loop ; branch R1!=0 5
SD 8(R1),F4 ; store X[I] 6

• Interchanged SUBI and SD - compiler has to notice


that offset is affected by swapping SUBI and SD
• SUBI and BNEZ in delay of ADDD, and SD in delay
slot of branch

15

Scheduled Example

• Computation: 3 cycles (LD, ADDD, SD)


• Branch overhead: 3 cycles (SUBI, BNEZ)
• 50% of cycles for each iteration spent in
branching

• Unroll loop to lower branch overhead - replicate


loop body several times and remove the
extraneous branches

• May also expose more scheduling opportunities -


chances to overlap independent instructions.
16

Page 8
Unrolled Example
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
LD F6,-8(R1)
ADDD F8,F6,F2
SD -8(R1),F8
LD F10,-16(R1)
ADDD F12,F10,F2
SD -16(R1),F12
LD F14,-24(R1)
ADDD F16,F14,F2
SD -24(R1),F16
SUBI R1,R1,32
BNEZ R1,Loop 17

Unrolled Example
Loop: LD F0,0(R1)
Unroll factor is 4
ADDD F4,F0,F2
assuming that the
SD 0(R1),F4
LD F6,-8(R1)
trip count is a
ADDD F8,F6,F2 multiple of 32.
SD -8(R1),F8
LD F10,-16(R1) Register names
ADDD F12,F10,F2 changed to avoid
SD -16(R1),F12 dependences.
LD F14,-24(R1)
ADDD F16,F14,F2
Offsets adjusted for
SD -24(R1),F16
the change in the
SUBI R1,R1,32
BNEZ R1,Loop
iteration count. 18

Page 9
Unrolled Example
Loop: LD F0,0(R1)
Loop evaluation
ADDD F4,F0,F2
eliminated for 3
SD 0(R1),F4
LD F6,-8(R1)
iterations (removes
ADDD F8,F6,F2 6 instructions)
SD -8(R1),F8
LD F10,-16(R1)
ADDD F12,F10,F2
SD -16(R1),F12
LD F14,-24(R1)
ADDD F16,F14,F2
SD -24(R1),F16
SUBI R1,R1,32
BNEZ R1,Loop 19

Unrolled Example
Loop: LD F0,0(R1)
2 The loop takes 28
ADDD F4,F0,F2
4 cycles per iteration,
SD 0(R1),F4
LD F6,-8(R1)
or 7 cycles (28/4=7)
2
ADDD F8,F6,F2 per original loop body.
4
SD -8(R1),F8
LD F10,-16(R1) Branches removed, so
2
4
ADDD F12,F10,F2 multiple bodies can
SD -16(R1),F12 be scheduled together
LD F14,-24(R1)
2
4 ADDD F16,F14,F2
SD -24(R1),F16
2 SUBI R1,R1,32
2 BNEZ R1,Loop 20

Page 10
Scheduled Unrolled Example
Loop: LD F0,0(R1)
Independent iterations
LD F6,-8(R1)
LD F10,-16(R1) used to mask latency.
LD F14,-24(R1)
ADDD F4,F0,F2 The loop takes 14
ADDD F8,F6,F2 cycles per iteration,
ADDD F12,F10,F2 or 3.5 cycles per
ADDD F16,F14,F2 original loop body
SD 0(R1),F4
(14/4=3.5).
SD -8(R1),F8
SUBI R1,R1,32
Offset adjustments SD 16(R1),F12
SUBI moved above the
-32+16 = -16 BNEZ R1,Loop last two SDs, so the
-32+8 = -24 SD 8(R1),F16 offset is adjusted 21

Loop Unrolling

• When upper bound isn’t known, we can


precondition the loop iteration count.

– Preconditioning loop: Replicate original loop with a trip count


of (n mod F), where n is iteration variable and F is unroll
factor.
– Unrolled loop: Unroll original loop by F with a trip count of n/F.

• Code size growth can be significant with


unrolling. What about the I-cache????

22

Page 11
Summary of Unrolling Example

1) Determine we could move SD


2) Determine loop bodies are independent and
there is a benefit to unrolling
3) Change register names to avoid name conflicts
4) Eliminate extra branches, adjust trip count
5) Determine loads and stores can be
interchanged in the unrolled loop
6) Schedule the code, preserving dependences

23

Dependences

• The key to the unrolling example was finding


independent instructions that can execute in
parallel.

• Dependences constrain ILP

• Dependence types
– Data (dependence on values)
– Name (conflicts on names)
– Control (dynamic data flow dependent on branches)

24

Page 12
Data Dependences

• Instruction j dependent on i if:


i ® j or i ® k, k ® j
(“®” is the producer-consumer relation)
• This is a “true dependence”

Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
...
SUBI R1,R1,8

BNEZ R1,Loop
25

Data Dependences

• Instruction j dependent on i if:


i ® j or i ® k, k ® j
(“®” is the producer-consumer relation)
• This is a “true dependence”

Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
...
SUBI R1,R1,8

BNEZ R1,Loop
26

Page 13
Data Dependent Instructions

• Can’t fully overlap data dependent instrs. due to


producer-consumer relationship (RAW hazards).

• Between registers or between memory locations


LD A
C=A+B
ST C
... Data dependence through
LD C memory location C
D=C*F
ST D
...

27

Data Dependent Instructions

• Data dependences: Property of programs!

• Detecting and handling hazards: Property of a


pipeline!
– E.g., Data dependence means hazard possible, actual
behavior (length, detection) of hazard is pipeline related.

28

Page 14
Overcoming Data Dependences

• Maintain the dependence but remove (or reduce


the impact) of the hazard

• Eliminate dependences by compiler scheduling

Loop: ...
SUBI R1,R1,8
LD F6,0(R1)
ADDD F8,F6,F2
SD 0(R1),F8
SUBI R1,R1,8
...

29

Overcoming Data Dependences

• Maintain the dependence but remove (or reduce


the impact) of the hazard

• Eliminate dependences by compiler scheduling

Loop: ... In this case, the data


SUBI R1,R1,8 dependences cause
LD F6,0(R1) execution to be serialized
ADDD F8,F6,F2
SD 0(R1),F8
SUBI R1,R1,8
...

30

Page 15
Overcoming Data Dependences

• Maintain the dependence but remove (or reduce


the impact) of the hazard

• Eliminate dependences by compiler scheduling

Loop: ... Compiler folds computation


SUBI R1,R1,8 of offsets in LD and SD
LD F6,-8(R1) and the iteration count
ADDD F8,F6,F2 decrement into one SUBI.
SD -8(R1),F8
SUBI R1,R1,8 Offsets adjustment by 8
...

31

Overcoming Data Dependences

• Eliminating data dependences - requires


knowledge of program structure (data flow
analysis)

• Hence, the compiler applies code


transformations to eliminate the dependences

• Hardware (and the compiler) can reduce the


impact of data dependences by mitigating effect
of the hazard

32

Page 16
Name Dependences

• Antidependence - j writes same location as i and


j finishes before i (WAR hazard)

• Output dependence - i and j write the same


location (WAW hazard)

• Not “true dependences” because they represent


conflicts and not actual data flow.

33

Overcoming Name Dependences

• Name dependences - represent conflicts for the


same name

• Eliminate by changing names!

• For registers - “register renaming” - done by


compiler or hardware

34

Page 17
Renaming in the Unroll Example
Loop: LD F0,0(R1)
Where are the name
ADDD F4,F0,F2
dependences in this
SD 0(R1),F4
LD F0,-8(R1)
loop?
ADDD F4,F0,F2
SD -8(R1),F4
LD F0,-16(R1)
ADDD F4,F0,F2
SD -16(R1),F4
LD F0,-24(R1)
ADDD F4,F0,F2
SD -24(R1),F4
SUBI R1,R1,32
BNEZ R1,Loop 35

Renaming in the Unroll Example


Loop: LD F0,0(R1)
The name dependences
ADDD F4,F0,F2
force the loop to be
SD 0(R1),F4
LD F0,-8(R1)
mostly scheduled in a
ADDD F4,F0,F2 serial fashion.
SD -8(R1),F4
LD F0,-16(R1) Eliminate the name
ADDD F4,F0,F2 dependences by
SD -16(R1),F4 register renaming
LD F0,-24(R1)
ADDD F4,F0,F2
SD -24(R1),F4
SUBI R1,R1,32 Output dependence
BNEZ R1,Loop Antidependence 36

Page 18
Renaming in the Unroll Example
Loop: LD F0,0(R1)
With registers renamed,
ADDD F4,F0,F2
the true dependences
SD 0(R1),F4
LD F6,-8(R1)
now constrain the
ADDD F8,F6,F2 scheduling of the loop.
SD -8(R1),F8
LD F10,-16(R1) Each loop body can
ADDD F12,F10,F2 be overlapped with the
SD -16(R1),F12 others.
LD F14,-24(R1)
ADDD F16,F14,F2
SD -24(R1),F16
SUBI R1,R1,32
BNEZ R1,Loop 37

Control Dependences

• Determines ordering of instructions with respect


to branches

• All instructions are dependent on some branch

C1 C2 C
if (cond)
A; A,B,C are completely
if (cond2) A B independent - C could
B; depend on A or B and
C; would need an edge
return; return (A,C) or (B,C)

38

Page 19
Control Constraints

• If i dependent on branch B, then i can’t be


moved above B (so its exception isn’t controlled
by B)

• If i not dependent on branch B, then i can’t be


moved after B (so its execution dependent on B)
Loop: …
BEQZ R1,exit
LD F6,0(R1)
ADD F8,F6,F2
SD 0(R1),F8
SUBI R1,R1,8
BEQZ R1,exit
39

Control Constraints

• If i dependent on branch B, then i can’t be


moved above B

• If i not dependent on branch B, then i can’t be


moved after B.
Loop: …
BEQZ R1,exit Presence of BEQZ
LD F6,0(R1) prevents moving
ADD F8,F6,F2 instructions because
SD 0(R1),F8 of the control
SUBI R1,R1,8 dependences
BEQZ R1,exit

40

Page 20
Control Constraints

• If i dependent on branch B, then i can’t be


moved above B.

• If i not dependent on branch B, then i can’t be


moved after B.
Loop: …
BEQZ R1,exit Because trip count is a
LD F6,0(R1) multiple of 32, the compiler
ADD F8,F6,F2 can determine that the
SD 0(R1),F8 intervening branches won’t
SUBI R1,R1,8 be taken. The control
BEQZ R1,exit dependences are reduced.

41

Overcoming Control Constraints

• DLX pipeline - control dependences preserved by


– Executing in order
– Not executing before branch outcome known

• To improve performance - may violate control


dependences
– As long as program correctness maintained!
– Preserve exception and data flow behavior
– Exceptions: “No new exceptions” caused by code motion
above a branch.

42

Page 21
Overcoming Control Constraints

• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …

43

Overcoming Control Constraints

• Exceptions
BEQZ R2,L1 Hoist LW above the branch
LW R1,0(R2)
L1: …

44

Page 22
Overcoming Control Constraints

• Exceptions
BEQZ R2,L1 Hoist LW above the branch -
LW R1,0(R2) a new exception may result
L1: … since BEQZ guards R2

45

Overcoming Control Constraints

• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …

• Data flow
ADD R1,R2,R3
BEQZ R4,L0
SUB R1,R5,R6
L0: OR R7,R1,R8

46

Page 23
Overcoming Control Constraints

• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …

• Data flow If branch taken, R1


ADD R1,R2,R3
comes from the ADD
BEQZ R4,L0
SUB R1,R5,R6
L0: OR R7,R1,R8

47

Overcoming Control Constraints

• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …

• Data flow If branch not taken, R1


ADD R1,R2,R3
comes from the SUB
BEQZ R4,L0
SUB R1,R5,R6
L0: OR R7,R1,R8

48

Page 24
Overcoming Control Constraints

• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …

• Data flow Have to preserve the


ADD R1,R2,R3
data flow although the
BEQZ R4,L0 dependences on OR are
SUB R1,R5,R6 preserved.
L0: OR R7,R1,R8

• OR data dependent on ADD, SUB but control


dependent on BEQZ 49

Overcoming Control Constraints

• Sometimes it’s OK
ADD R1,R2,R3
Suppose R4 is dead
BEQZ R12,L0
at L0
SUB R4,R5,R6
ADD R5,R4,R9
L0: OR R7,R8,R9 R4 is dead here

50

Page 25
Overcoming Control Constraints

• Sometimes it’s OK
ADD R1,R2,R3
Suppose R4 is dead
BEQZ R12,L0
after L0 - then we
SUB R4,R5,R6
can move the SUB
ADD R5,R4,R9 above the branch
L0: OR R7,R8,R9

51

Overcoming Control Constraints

• Sometimes it’s OK
ADD R1,R2,R3
Suppose R4 is dead
BEQZ R12,L0
after L0 - then we
SUB R4,R5,R6
can move the SUB
ADD R5,R4,R9 above the branch
L0: OR R7,R8,R9

• R4 dead, so data flow not affected and SUB


won’t cause an exception.

• It may also be OK if we can “fix up” the data flow


if we went the “wrong way” - path sensitive
optimizations (speculation) try to do this. 52

Page 26
Loop Carried Dependences

• A parallelizable loop - no cyclic dependences


across successive iterations of the loop

• Loop carried dependence - exists across more


than one iteration of the loop

for (i = 0; i < 100; i++)


A[i+1] = A[i] + C[i] + S;

• A[i+1] depends on the value of A[i] from the


previous loop iteration.
53

Loop Carried Dependences


for (I=1; I<=100; I=I+1) {
A[I] = A[I] + B[I];
B[I+1] = C[I] + D[I];
}

• Can we parallelize this loop?

54

Page 27
Loop Carried Dependences
for (I=1; I<=100; I=I+1) {
A[I] = A[I] + B[I];
B[I+1] = C[I] + D[I];
}

• Loop carried dependence involving B[].


– S1 dependent on S2 from previous iteration.
– S2 not dependent on S1

• Hence, we can parallelize this loop (no cycles)

55

Parallelized Loop
A[I] = A[1] + B[1];
for (I = 1; I <= 99; I+=1) {
B[I+1] = C[I] + D[I];
A[I+1] = A[I+1] + B[I+1];
}
B[101] = C[100] + D[100];

• Loop carried dependence removed - moved to


within an iteration rather than between iterations.

• Statements within the iteration must obey the true


dependence on B[I+1].
56

Page 28

You might also like