Chapter 3: Instruction-Level Parallelism

Chapter 3: Instruction-Level
Parallelism
Instruction-Level Parallelism
• Straightforward pipelining overlaps the execution

of multiple independent instructions (processing
overlapped across pipeline stages).
• Instruction-level parallelism: Exploiting

parallelism in instructions to improve
performance.
• Generally invisible to the programmer (but not

the compiler!)
Page 1
Improving Performance
Pipeline CPI
Effective CPI = Ideal CPI + Structural Stalls + RAW Stalls +
WAR Stalls + Control Stalls
To improve performance, we can tackle any one of

these terms.
Control stalls Loop unrolling, dynamic branch prediction, speculation

RAW stalls Scheduling, scoreboarding, memory disambiguation
WAR stalls Scheduling with register renaming
Data stalls Dependence analysis, software pipelining, speculation
Ideal CPI Multiple issue, dependence analysis, software pipelining
Basic Unit of Instruction-Level

Parallelism
• Sequence at instruction level is a basic block.
• Basic block (BB)
– Straight-line code with no branches into the sequence except
at the top and no branches out except at the bottom.
• Procedure represented as a control flow graph on
nodes that are basic blocks
Code Sequence
if (x < y)
A
else
B
C 4
Page 2
Basic Unit of Instruction-Level
Parallelism
• Sequence at instruction level is a basic block.
• Basic block (BB)
– Straight-line code with no branches into the sequence except
at the top and no branches out except at the bottom.
• Procedure represented as a control flow graph on
nodes that are basic blocks
Code Sequence
if (x < y)
A
else
B
C 5
Basic Blocks
L0: ADD r1,r2,r3

SUB r4,r1,r5
BEQZ r4,L0
ADD r3,r2,r1
BNEZ r3,L1
OR r8,r10,r11
L1: AND r5,r6,r7
OR r3,r2,r1
BEQZ r5,L2
Page 3
Basic Blocks
L0: ADD r1,r2,r3 BB #1

SUB r4,r1,r5
BEQZ r4,L0
ADD r3,r2,r1 BB #2
BNEZ r3,L1
OR r8,r10,r11 BB #3
L1: AND r5,r6,r7 BB #4
OR r3,r2,r1
BEQZ r5,L2
Control Flow Graph
L0: ADD r1,r2,r3 BB #1

SUB r4,r1,r5
BEQZ r4,L0
ADD r3,r2,r1 BB #2
BNEZ r3,L1
OR r8,r10,r11 BB #3
L1: AND r5,r6,r7 BB #4
OR r3,r2,r1
BEQZ r5,L2
Page 4
Parallelism in Basic Blocks
• Amount of parallelism within a BB is relatively

small.
– Average BB size is 4-6 instructions

– Parallelism less b/c instructions are typically interdependent
• Consider branches 20% of instruction mix. How

big is the typical basic block??
20%= 1/5 instructions is a branch ® 5
Exploiting Parallelism
• BB is too small and interdependent instructions.

• Exploit parallelism across basic blocks
• ILP frequently found in loops, across iterations of

the loop. E.g.:
for (I=1; I<=1000; I+=1)

x[I] = x[I] + y[I];
• Every iteration is independent and can be fully

overlapped with every other one
10
Page 5
Loop Unrolling
• Keep pipeline full with independent instructions

that can execute simultaneously
– Operation latency
– Finding independent instructions
• Expose parallelism across loop iterations
– Compiler pipeline scheduling

– Hardware dynamic scheduling
11
Loop Unrolling Example
• DLX pipeline with latencies
FP ALU op ® FP ALU op 3 cycles

FP ALU op ® Store double 2 cycles
Load double ® FP ALU op 1 cycle
Load double ® Store double 0 cycle
• Simple loop (array X and scalar S are doubles)
for (I = 1; I < 1000; I++)

x[I] = x[I] + s;
12
Page 6
Straightforward DLX for Example
Loop: LD F0,0(R1) ; F0=array elem
ADDD F4,F0,F2 ; add scalar S
SD 0(R1),F4 ; store X[I]
SUBI R1,R1,8 ; decrement ptr
; double=8bytes
BNEZ R1,Loop ; branch R1!=0
• R1 address of array (highest element), F2

contains the scalar S, and &X[1]=8.
• How fast does this code run on DLX without

scheduling?
13
Latency of Straightforward
Example
Loop: LD F0,0(R1) ; F0=array elem 1
stall 2
ADDD F4,F0,F2 ; add scalar S 3
stall 4
stall 5
SD 0(R1),F4 ; store X[I] 6
SUBI R1,R1,8 ; decrement ptr 7
stall 8
BNEZ R1,Loop ; branch R1!=0 9
stall 10
• Total of 5 stall cycles

14
Page 7
Latency of Scheduled Example
Loop: LD F0,0(R1) ; F0=array elem 1
stall 2
SUBI R1,R1,8 ; decrement ptr 3
ADDD F4,F0,F2 ; add scalar S 4
BNEZ R1,Loop ; branch R1!=0 5
SD 8(R1),F4 ; store X[I] 6
• Interchanged SUBI and SD - compiler has to notice

that offset is affected by swapping SUBI and SD
• SUBI and BNEZ in delay of ADDD, and SD in delay
slot of branch
15
Scheduled Example
• Computation: 3 cycles (LD, ADDD, SD)

• Branch overhead: 3 cycles (SUBI, BNEZ)
• 50% of cycles for each iteration spent in
branching
• Unroll loop to lower branch overhead - replicate

loop body several times and remove the
extraneous branches
• May also expose more scheduling opportunities -

chances to overlap independent instructions.
16
Page 8
Unrolled Example
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
LD F6,-8(R1)
ADDD F8,F6,F2
SD -8(R1),F8
LD F10,-16(R1)
ADDD F12,F10,F2
SD -16(R1),F12
LD F14,-24(R1)
ADDD F16,F14,F2
SD -24(R1),F16
SUBI R1,R1,32
BNEZ R1,Loop 17
Unrolled Example
Loop: LD F0,0(R1)
Unroll factor is 4
ADDD F4,F0,F2
assuming that the
SD 0(R1),F4
LD F6,-8(R1)
trip count is a
ADDD F8,F6,F2 multiple of 32.
SD -8(R1),F8
LD F10,-16(R1) Register names
ADDD F12,F10,F2 changed to avoid
SD -16(R1),F12 dependences.
LD F14,-24(R1)
ADDD F16,F14,F2
Offsets adjusted for
SD -24(R1),F16
the change in the
SUBI R1,R1,32
BNEZ R1,Loop
iteration count. 18
Page 9
Unrolled Example
Loop: LD F0,0(R1)
Loop evaluation
ADDD F4,F0,F2
eliminated for 3
SD 0(R1),F4
LD F6,-8(R1)
iterations (removes
ADDD F8,F6,F2 6 instructions)
SD -8(R1),F8
LD F10,-16(R1)
ADDD F12,F10,F2
SD -16(R1),F12
LD F14,-24(R1)
ADDD F16,F14,F2
SD -24(R1),F16
SUBI R1,R1,32
BNEZ R1,Loop 19
Unrolled Example
Loop: LD F0,0(R1)
2 The loop takes 28
ADDD F4,F0,F2
4 cycles per iteration,
SD 0(R1),F4
LD F6,-8(R1)
or 7 cycles (28/4=7)
2
ADDD F8,F6,F2 per original loop body.
4
SD -8(R1),F8
LD F10,-16(R1) Branches removed, so
2
4
ADDD F12,F10,F2 multiple bodies can
SD -16(R1),F12 be scheduled together
LD F14,-24(R1)
2
4 ADDD F16,F14,F2
SD -24(R1),F16
2 SUBI R1,R1,32
2 BNEZ R1,Loop 20
Page 10
Scheduled Unrolled Example
Loop: LD F0,0(R1)
Independent iterations
LD F6,-8(R1)
LD F10,-16(R1) used to mask latency.
LD F14,-24(R1)
ADDD F4,F0,F2 The loop takes 14
ADDD F8,F6,F2 cycles per iteration,
ADDD F12,F10,F2 or 3.5 cycles per
ADDD F16,F14,F2 original loop body
SD 0(R1),F4
(14/4=3.5).
SD -8(R1),F8
SUBI R1,R1,32
Offset adjustments SD 16(R1),F12
SUBI moved above the
-32+16 = -16 BNEZ R1,Loop last two SDs, so the
-32+8 = -24 SD 8(R1),F16 offset is adjusted 21
Loop Unrolling
• When upper bound isn’t known, we can

precondition the loop iteration count.
– Preconditioning loop: Replicate original loop with a trip count

of (n mod F), where n is iteration variable and F is unroll
factor.
– Unrolled loop: Unroll original loop by F with a trip count of n/F.
• Code size growth can be significant with

unrolling. What about the I-cache????
22
Page 11
Summary of Unrolling Example
1) Determine we could move SD

2) Determine loop bodies are independent and
there is a benefit to unrolling
3) Change register names to avoid name conflicts
4) Eliminate extra branches, adjust trip count
5) Determine loads and stores can be
interchanged in the unrolled loop
6) Schedule the code, preserving dependences
23
Dependences
• The key to the unrolling example was finding

independent instructions that can execute in
parallel.
• Dependences constrain ILP
• Dependence types
– Data (dependence on values)
– Name (conflicts on names)
– Control (dynamic data flow dependent on branches)
24
Page 12
Data Dependences
• Instruction j dependent on i if:

i ® j or i ® k, k ® j
(“®” is the producer-consumer relation)
• This is a “true dependence”
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
...
SUBI R1,R1,8
…
BNEZ R1,Loop
25
Data Dependences
• Instruction j dependent on i if:

i ® j or i ® k, k ® j
(“®” is the producer-consumer relation)
• This is a “true dependence”
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
...
SUBI R1,R1,8
…
BNEZ R1,Loop
26
Page 13
Data Dependent Instructions
• Can’t fully overlap data dependent instrs. due to

producer-consumer relationship (RAW hazards).
• Between registers or between memory locations

LD A
C=A+B
ST C
... Data dependence through
LD C memory location C
D=C*F
ST D
...
27
Data Dependent Instructions
• Data dependences: Property of programs!
• Detecting and handling hazards: Property of a

pipeline!
– E.g., Data dependence means hazard possible, actual
behavior (length, detection) of hazard is pipeline related.
28
Page 14
Overcoming Data Dependences
• Maintain the dependence but remove (or reduce

the impact) of the hazard
• Eliminate dependences by compiler scheduling
Loop: ...
SUBI R1,R1,8
LD F6,0(R1)
ADDD F8,F6,F2
SD 0(R1),F8
SUBI R1,R1,8
...
29

Loop: ... In this case, the data

SUBI R1,R1,8 dependences cause
LD F6,0(R1) execution to be serialized
ADDD F8,F6,F2
SD 0(R1),F8
SUBI R1,R1,8
...
30
Page 15

Loop: ... Compiler folds computation

SUBI R1,R1,8 of offsets in LD and SD
LD F6,-8(R1) and the iteration count
ADDD F8,F6,F2 decrement into one SUBI.
SD -8(R1),F8
SUBI R1,R1,8 Offsets adjustment by 8
...
31
• Eliminating data dependences - requires

knowledge of program structure (data flow
analysis)
• Hence, the compiler applies code

transformations to eliminate the dependences
• Hardware (and the compiler) can reduce the

impact of data dependences by mitigating effect
of the hazard
32
Page 16
Name Dependences
• Antidependence - j writes same location as i and

j finishes before i (WAR hazard)
• Output dependence - i and j write the same

location (WAW hazard)
• Not “true dependences” because they represent

conflicts and not actual data flow.
33
Overcoming Name Dependences
• Name dependences - represent conflicts for the

same name
• Eliminate by changing names!
• For registers - “register renaming” - done by

compiler or hardware
34
Page 17
Renaming in the Unroll Example
Loop: LD F0,0(R1)
Where are the name
ADDD F4,F0,F2
dependences in this
SD 0(R1),F4
LD F0,-8(R1)
loop?
ADDD F4,F0,F2
SD -8(R1),F4
LD F0,-16(R1)
ADDD F4,F0,F2
SD -16(R1),F4
LD F0,-24(R1)
ADDD F4,F0,F2
SD -24(R1),F4
SUBI R1,R1,32
BNEZ R1,Loop 35

Loop: LD F0,0(R1)
The name dependences
ADDD F4,F0,F2
force the loop to be
SD 0(R1),F4
LD F0,-8(R1)
mostly scheduled in a
ADDD F4,F0,F2 serial fashion.
SD -8(R1),F4
LD F0,-16(R1) Eliminate the name
ADDD F4,F0,F2 dependences by
SD -16(R1),F4 register renaming
LD F0,-24(R1)
ADDD F4,F0,F2
SD -24(R1),F4
SUBI R1,R1,32 Output dependence
BNEZ R1,Loop Antidependence 36
Page 18
Loop: LD F0,0(R1)
With registers renamed,
ADDD F4,F0,F2
the true dependences
SD 0(R1),F4
LD F6,-8(R1)
now constrain the
ADDD F8,F6,F2 scheduling of the loop.
SD -8(R1),F8
LD F10,-16(R1) Each loop body can
ADDD F12,F10,F2 be overlapped with the
SD -16(R1),F12 others.
LD F14,-24(R1)
ADDD F16,F14,F2
SD -24(R1),F16
SUBI R1,R1,32
BNEZ R1,Loop 37
Control Dependences
• Determines ordering of instructions with respect

to branches
• All instructions are dependent on some branch
C1 C2 C
if (cond)
A; A,B,C are completely
if (cond2) A B independent - C could
B; depend on A or B and
C; would need an edge
return; return (A,C) or (B,C)
38
Page 19
Control Constraints
• If i dependent on branch B, then i can’t be

moved above B (so its exception isn’t controlled
by B)
• If i not dependent on branch B, then i can’t be

moved after B (so its execution dependent on B)
Loop: …
BEQZ R1,exit
LD F6,0(R1)
ADD F8,F6,F2
SD 0(R1),F8
SUBI R1,R1,8
BEQZ R1,exit
39
…
Control Constraints

moved above B

moved after B.
Loop: …
BEQZ R1,exit Presence of BEQZ
LD F6,0(R1) prevents moving
ADD F8,F6,F2 instructions because
SD 0(R1),F8 of the control
SUBI R1,R1,8 dependences
BEQZ R1,exit
…
40
Page 20
Control Constraints

moved above B.

moved after B.
Loop: …
BEQZ R1,exit Because trip count is a
LD F6,0(R1) multiple of 32, the compiler
ADD F8,F6,F2 can determine that the
SD 0(R1),F8 intervening branches won’t
SUBI R1,R1,8 be taken. The control
BEQZ R1,exit dependences are reduced.
…
41
Overcoming Control Constraints
• DLX pipeline - control dependences preserved by

– Executing in order
– Not executing before branch outcome known
• To improve performance - may violate control

dependences
– As long as program correctness maintained!
– Preserve exception and data flow behavior
– Exceptions: “No new exceptions” caused by code motion
above a branch.
42
Page 21
• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …
43
• Exceptions
BEQZ R2,L1 Hoist LW above the branch
LW R1,0(R2)
L1: …
44
Page 22
• Exceptions
BEQZ R2,L1 Hoist LW above the branch -
LW R1,0(R2) a new exception may result
L1: … since BEQZ guards R2
45
• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …
• Data flow
ADD R1,R2,R3
BEQZ R4,L0
SUB R1,R5,R6
L0: OR R7,R1,R8
46
Page 23
• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …
• Data flow If branch taken, R1

ADD R1,R2,R3
comes from the ADD
BEQZ R4,L0
SUB R1,R5,R6
L0: OR R7,R1,R8
47
• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …
• Data flow If branch not taken, R1

ADD R1,R2,R3
comes from the SUB
BEQZ R4,L0
SUB R1,R5,R6
L0: OR R7,R1,R8
48
Page 24
• Exceptions
BEQZ R2,L1
LW R1,0(R2)
L1: …
• Data flow Have to preserve the

ADD R1,R2,R3
data flow although the
BEQZ R4,L0 dependences on OR are
SUB R1,R5,R6 preserved.
L0: OR R7,R1,R8
• OR data dependent on ADD, SUB but control

dependent on BEQZ 49
• Sometimes it’s OK
ADD R1,R2,R3
Suppose R4 is dead
BEQZ R12,L0
at L0
SUB R4,R5,R6
ADD R5,R4,R9
L0: OR R7,R8,R9 R4 is dead here
50
Page 25
ADD R1,R2,R3
Suppose R4 is dead
BEQZ R12,L0
after L0 - then we
SUB R4,R5,R6
can move the SUB
ADD R5,R4,R9 above the branch
L0: OR R7,R8,R9
51
ADD R1,R2,R3
Suppose R4 is dead
BEQZ R12,L0
after L0 - then we
SUB R4,R5,R6
can move the SUB
ADD R5,R4,R9 above the branch
L0: OR R7,R8,R9
• R4 dead, so data flow not affected and SUB

won’t cause an exception.
• It may also be OK if we can “fix up” the data flow

if we went the “wrong way” - path sensitive
optimizations (speculation) try to do this. 52
Page 26
Loop Carried Dependences
• A parallelizable loop - no cyclic dependences

across successive iterations of the loop
• Loop carried dependence - exists across more

than one iteration of the loop
for (i = 0; i < 100; i++)

A[i+1] = A[i] + C[i] + S;
• A[i+1] depends on the value of A[i] from the

previous loop iteration.
53

for (I=1; I<=100; I=I+1) {
A[I] = A[I] + B[I];
B[I+1] = C[I] + D[I];
}
• Can we parallelize this loop?
54
Page 27
for (I=1; I<=100; I=I+1) {
A[I] = A[I] + B[I];
B[I+1] = C[I] + D[I];
}
• Loop carried dependence involving B[].

– S1 dependent on S2 from previous iteration.
– S2 not dependent on S1
• Hence, we can parallelize this loop (no cycles)
55
Parallelized Loop
A[I] = A[1] + B[1];
for (I = 1; I <= 99; I+=1) {
B[I+1] = C[I] + D[I];
A[I+1] = A[I+1] + B[I+1];
}
B[101] = C[100] + D[100];
• Loop carried dependence removed - moved to

within an iteration rather than between iterations.
• Statements within the iteration must obey the true

dependence on B[I+1].
56
Page 28

Chapter 3: Instruction-Level Parallelism

Uploaded by

Copyright:

Available Formats

Chapter 3: Instruction-Level Parallelism

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3: Instruction-Level Parallelism

Uploaded by

Copyright:

Available Formats

Chapter 3: Instruction-Level

• Straightforward pipelining overlaps the execution

• Instruction-level parallelism: Exploiting

• Generally invisible to the programmer (but not

To improve performance, we can tackle any one of

Control stalls Loop unrolling, dynamic branch prediction, speculation

Basic Unit of Instruction-Level

L0: ADD r1,r2,r3

L0: ADD r1,r2,r3 BB #1

Control Flow Graph

L0: ADD r1,r2,r3 BB #1

• Amount of parallelism within a BB is relatively

– Average BB size is 4-6 instructions

• Consider branches 20% of instruction mix. How

20%= 1/5 instructions is a branch ® 5

• BB is too small and interdependent instructions.

• ILP frequently found in loops, across iterations of

for (I=1; I<=1000; I+=1)

• Every iteration is independent and can be fully

• Keep pipeline full with independent instructions

• Expose parallelism across loop iterations

– Compiler pipeline scheduling

Loop Unrolling Example

• DLX pipeline with latencies

FP ALU op ® FP ALU op 3 cycles

• Simple loop (array X and scalar S are doubles)

for (I = 1; I < 1000; I++)

• R1 address of array (highest element), F2

• How fast does this code run on DLX without

• Total of 5 stall cycles

• Interchanged SUBI and SD - compiler has to notice

• Computation: 3 cycles (LD, ADDD, SD)

• Unroll loop to lower branch overhead - replicate

• May also expose more scheduling opportunities -

• When upper bound isn’t known, we can

– Preconditioning loop: Replicate original loop with a trip count

• Code size growth can be significant with

1) Determine we could move SD

• The key to the unrolling example was finding

• Dependences constrain ILP

• Instruction j dependent on i if:

• Instruction j dependent on i if:

• Can’t fully overlap data dependent instrs. due to

• Between registers or between memory locations

Data Dependent Instructions

• Data dependences: Property of programs!

• Detecting and handling hazards: Property of a

• Maintain the dependence but remove (or reduce

• Eliminate dependences by compiler scheduling

Overcoming Data Dependences

• Maintain the dependence but remove (or reduce

• Eliminate dependences by compiler scheduling

Loop: ... In this case, the data

• Maintain the dependence but remove (or reduce

• Eliminate dependences by compiler scheduling

Loop: ... Compiler folds computation

Overcoming Data Dependences

• Eliminating data dependences - requires

• Hence, the compiler applies code