Grain Packing & Scheduling Ch2 Hwang - Copy

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 80

Steps in Creating a Parallel Program

Partitioning
Parallel Algorithm

D A O M
e s r a
c s c p
o i h p
m g p0 p1 e p0 p1 i
p s P0 P1
n n
o m t g
s e r
i n a
t t t
P2 P3
i p2 p3 i p2 p3
o o
n n

Sequential Tasks Processes Parallel Processors


computation program

4 steps:
Decomposition, Assignment, Orchestration, Mapping+ Scheduling
• Done by programmer or system software (compiler, runtime, ...)
• Issues are the same, so assume programmer does it all explicitly

1
Programming Level

Concurrent Processing
These can be attributed to issues such as:
• level of concurrency,
• computational granularity,
• time and space complexity,
• communication latencies, and
• scheduling and load balancing.
Programming Level
Concurrent Processing
Independence among segments of a program is
a necessary condition to execute them
concurrently.
In general, two independent segments should be
executed in any order without affecting each
other — a segment can be an instruction or a
sequence of instructions.
Programming Level
Concurrent Processing
Dependence graph is used to determine the
dependence relations among the program
segments.
Programming Level
Concurrent Processing
Dependence Graph — A dependence graph is a
directed graph G º G (N,A) in which the set of
nodes (N) represents the program segments and
the set of the directed arcs (A) shows the order
of dependence among the segments
Programming Level
Concurrent Processing
Dependence Graph
• Dependence comes in various forms and kinds:
− Data dependence
− Control dependence
− Resource dependence
Programming Level
Concurrent Processing
Data Dependence: If an instruction uses a
value produced by a previous instruction, then
the second instruction is data dependent to the
first instruction.
Data Dependence comes in different forms:
Programming Level
Data dependence
 Flow dependence: At least one output of S1 is an input
of S2 (Read-After-Write: RAW). 1 ® 2
S S

 Anti dependence: Output of S2 is overlapped with the


input to S1 (Write-After-Read: WAR). S1 S2

 Output dependence: S1 and S2 write to the same


S1 location
S2
(Write-After-Write: WAW).
Programming Level

Data dependence
I/O dependence: The same file is referred to by
both I/O statements.

Unknown dependence: The dependence relation


can not be determined.
Programming Level
Example
Assume the following sequence of the
instructions:
• S :
1
R
1 ¬ (A)
• S :
2
R
2 ¬ (R1) + (R2)
• S :
3
R
1 ¬ (R3)
• S :
4
B
¬ (R1)
Programming Level
Example
SS
11

S2 S 4

S3
Programming Level
Control dependence
The order of execution is determined during
run-time — Conditional statements.
Control dependence could also exist between
operations performed in successive iterations of
a loop:

Do I = 1, N
If (A(I-1) = 0) then A(I) = 0
End
Programming Level
Control dependence
Control dependence often does not allow
efficient exploitation of parallelism.
Programming Level
Resource dependence
Conflict in using shared resources such as
concurrent request for the same functional unit.
A resource conflict arises when two instructions
attempt to use the same resource at the same
time.
Programming Level
Resource dependence
Within the scope of the resource dependence
then we can talk about storage dependence,
ALU dependence,...
Programming Level
Question
What is “true Dependence”?
What is “False Dependence”?
Programming Level
Concurrent Processing
Bernstein's Conditions
• Let Ii and Oi be the input and output sets of process
Pi, respectively. The two processes P1 and P2 can be
executed in parallel (P1 || P2) iff:
I1 Ç O2 = 0 WAR
I2 Ç O1 = 0 RAW
O1 Ç O2 = 0 WAW
Programming Level
Concurrent Processing
Bernstein's Conditions
• In general, P1, P2,..., Pk can be executed in parallel if
Bernstein condition is held for every pair of
processes:

P1 || P2 || P3... || Pk iff
Pi || Pj " i¹j
Programming Level
Concurrent Processing
Bernstein's Conditions
· parallelism relation (||) is commutative:
Pi || Pj Þ Pj || Pi
· parallelism relation (||) is not transitive:

Pi || Pj and Pj || Pk
does not necessarily guarantee Pi || Pk

Fall 2004
Programming Level
Concurrent Processing
Bernstein's Conditions
· parallelism relation (||) is not equivalence relation:
· parallelism relation (||) is associative:

Pi || Pj || Pk Þ (Pi || Pj) || Pk = Pi || (Pj || Pk)

Fall 2004
Programming Level
Concurrent Processing
Bernstein's Conditions — Example
• Detect parallelism in the following program, assume
a uniform execution time:

P1: C=D*E
P2: M=G+C
P3: A= B+ C
P4: C=L+M
P5: F=G/E
Programming Level
Concurrent Processing — Bernstein's Condition
D E

E G
*

G /

+1
B F

+2
L
+3

C
Programming Level
Concurrent Processing — Bernstein's Condition
Example: If two adders are available then,

D E E G

* /
B
G
F
+1 +2

L
A
+3
C
Programming Level
Concurrent Processing
Bernstein's Conditions — Example
P
1
*

+1 +3 P4
P2
/

+2 P5

P3

Resource dependence Data dependence


Programming Level
Concurrent Processing
Hardware parallelism is referred to the type and
degree of parallelism defined by the
architecture and hardware multiplicity — a k-
issue processor is a processor with hardware
capability that issues K instructions per
machine cycle.
Programming Level
Concurrent Processing
Software parallelism is defined by the control
and data dependence of programs. It is a
function of algorithm, programming style, and
compiler optimization.
Programming Level
Concurrent Processing
Hardware vs. software parallelism

N = (A * B) + (C * D)
M = (A * B) - (C * D)
Programming Level
Concurrent Processing
Hardware vs. software parallelism
• In machine code we have:
Load R1, A
Load R2, B
Load R3, C
Load R4, D
Mult Rx, R1, R2
Mult Ry, R3, R4
Add R1, Rx, Ry
Sub R2, Rx, Ry
Store N, R1
Programming Level
Concurrent Processing
Hardware vs. software parallelism
• A machine which allows parallel multiplications,
add/subtraction and simultaneous load/store
operations gives an average software parallelism of
(10/4) = 2.5 instructions.
Programming Level
Concurrent Processing
Hardware vs. software parallelism
Load A Load B Load C Load D

*1 *2

+ -

Store M
Store N
Programming Level
Concurrent Processing
Hardware vs. software parallelism
• For a machine which does not allow simultaneous
Load/Store and arithmetic operations, we have an
average software parallelism of (10/8) = 1.25
instructions.
Programming Level
Concurrent Processing Load A

Hardware vs. software parallelism Load B

*1 Load C

Load D

*2

- Store N

Store M
Programming Level
Concurrent Processing
Hardware vs. software parallelism
• Now assume a multiprocessor composing of two
processors:
Programming Level
Load A Load C

Concurrent Processing
Load B Load D
Hardware vs. software parallelism
*2
*1

Store K

}
Store L
added instructions
for inter processor
communication.
Load K Load L

+ -

Store N Store M
Programming Level
Concurrent Processing
Compilation support is one way to solve the
mismatch between software and hardware
parallelism.

A suitable compiler could exploit hardware


features in order to improve performance.
Programming Level
Summary
Dependence Graph
Different types of dependencies
Bernstein's Conditions
Hardware Parallelism
Software Parallelism
Programming Level
Concurrent Processing
Detection of concurrency in a program at
instruction level using techniques like Bernstein
is not practical, specially in case of large
programs.
So we will look at detection of concurrency at a
higher level. This bring us to the issue of
partitioning, scheduling, and load balancing.
Programming Level
Concurrent Processing
Two issues of concern:
• How can we partition a program into concurrent
branches, program modules, or grains to yield the
shortest execution time, and
• What is the optimal size of concurrent grains in a
computation?
Programming Level
Concurrent Processing
Partitioning is defined as the ability to partition
a program into subprograms that can be
executed in parallel.

Within the scope of partitioning, two major


issues are of concern:
• Grain Size, and
• Latency
Programming Level
Partitioning and Scheduling
Grain Size
• Granularity or grain size is a measure of the amount
of computation involved in a process — It
determines the basic program segment chosen for
parallel processing.

• Grain sizes are commonly named as:


− fine,
− medium, and
− coarse.
Programming Level
Partitioning and Scheduling
Latency
• Latency imposes a limiting factor on the scalability
of the machine size.
• Communication latency — inter-processor
communication — is a major factor of concern to a
system designer.
Programming Level

Partitioning and Scheduling


• In general, n tasks communicating with each
other may require n (n-1) / 2 communication
links among them.
• This leads to a communication bound which
limits the number of processors allowed in a
computer system.
Programming Level
Partitioning and Scheduling
Parallelism can be exploited at various levels:
• Job or Program
• Subprogram
• Procedure, Task, or Subroutine
• Loops or Iteration
• Instruction or Statement
Programming Level
Partitioning and Scheduling
The lower the level, the finer the granularity.

The finer the granularity, the higher the


communication and scheduling overheads.
The finer the granularity, the higher the degree
of parallelism.
Programming Level
Partitioning and Scheduling
Instruction and loop levels represent fine grain
size.
Procedure and subprogram levels represent
medium grain size, and

Job and subprogram levels represent coarse


grain size.
Programming Level
Partitioning and Scheduling
Instruction Level
• An instruction level granularity represents a grain
size consisting of up to 20 instructions.

• This level offers a high degree of parallelism in


common programs.

• It is expected that parallelism will be exploited by


compiler automatically.
Programming Level
Partitioning and Scheduling
Loop Level
• Here typically, we are concerned about iterative
loops with less than 500 instructions.
• At this level, one can distinguish two classes of
loop:
− Loops with independent iterations and,
− Loops with dependent iterations.
Programming Level
Partitioning and Scheduling
Procedure Level
• A typical grain at this level contains less than 2,000
instructions.
• Communication requirement and penalty at this
level is less compared with that of the fine grain
levels at the expense of more complexities in
detection of parallelism — inter-procedural
dependence.
Programming Level
Partitioning and Scheduling
Subprogram Level
• Multiprogramming on a uni-processor or on a
multiprocessor platform represents this level.

• In the past, parallelism at this level was


exploited by the programmers or algorithm
designers rather than by compilers.
Programming Level
Partitioning and Scheduling
Job Level
• This level corresponds to the parallel execution of
independent jobs (programs) on concurrent
computers.
• Supercomputers with a small number of powerful
processors are the best platform for this level of
parallelism. In general, parallelism at this level is
exploitable by the program loader and the operating
system.
Programming Level
Grain Packing
Let us look at the following example to
motivate the effect of grain size on the
performance.

In the following discussion, we make a


reference to the term program graph.
Programming Level
Grain Packing
A program graph is a dependence graph in
which:
• Each operation is labeled as (n, s) where n is the
node identifier and s is the execution time of the
node.
• Each edge is labeled as (v, d) where v is the edge
identifier and d is the communication delay.
Consider the following program graph:
Programming Level

Grain Packing — Fine grain size


1,1 2,1 3,1 4,1 5,1 6,1
b,6 d,6 f,6 f,6
a,6 c,6 d,6 e,6 e,6
d,6
7,2 8,2 9,2 10,2 11,2
j,4 k,4
i,4 4.
h,4 12,2
g,4
3. 13,2 l,3

14,2 m,3
15,2
16,2 n,4
17,2 o,3
p,3
Programming Level

Grain Packing — Fine grain size


Nodes 1-6 are all memory references
• 1 cycle to calculate address, and
• 6 cycles to fetch data from memory.
Other nodes are CPU operations requiring 2
cycles each.
Programming Level

Grain Packing — Fine grain size


The idea behind grain packing is to apply:
• Fine-grain first to achieve a higher degree of
parallelism,

• Combine multiple fine-grain nodes into a coarse-


grain node if it can eliminate unnecessary
communication delays or reduce the overall cost.
Programming Level
Grain Packing — Partitioning

1,1 2,1 3,1 4,1 5,1 6,1 A


b,6 d,6 f,6 f,6
a,6 c,6 d,6 e,6 e,6
d,6
B 7,2 8,2
C 9,2 10,2 11,2
j,4 k,4
i,4 4.
h,4 12,2
g,4
3. 13,2 l,3
E
15,2
14,2 m,3 D
16,2 n,4
17,2 o,3
p,3
Programming Level
Grain Packing — Coarse Grain
A,8
a,6
b,6 k,4 4.
c,6 d,6 3.
d,6 e,6 f,6

j,4
B,4 D,6
C,4

g,4 i,4
h,4 n,4

E,6
Programming Level

Grain Packing — Scheduling Fine Grain Size


01 7 9 10 12 18 20 22 24 28

P1
6 11 1 2 3 9 8 7

P2 4 5 10 12 13 14 15 16 17
6
0 12 8 10 14 16 19 21 24 26 30 32 35 37 40 42

Busy Communication Idle


Programming Level
Grain Packing — Scheduling Coarse Grain Size
0 8 14 18 22 28 32 38

P1
A C D E

0 14 18 22

P2
B 7

Busy Communication Idle


Programming Level
Grain Packing
As can be noted, through grain packing we
were able to reduce the overall execution time
by reducing the communication overhead.

The concept of grain packing can be recursively


applied.
Programming Level
Task Duplication
Task duplication is another way to reduce the
communication overhead and hence the
execution time.

Consider the following program graph:


Programming Level
Task Duplication
A,4
a,8
a,1

B,1 C,1

c,8 c,1
b,1

D,2 E,2

d,4 e,4
Programming Level
Task Duplication
0 4 6 13 21 23 27

P1
A B D

0 4 12 13 14 16 20

P2
E C 7E

Busy Communication Idle


Programming Level
Task Duplication
Now let us duplicate tasks A and C:
A,4 A',4
a,1 a,1
a,1

B,1 C',1 C,1

c,1 c,1
b,1
E,2
D,2
e,4
d,4
Programming Level
Task Duplication
0 4 6 7 8 10 14

P1
A B C D

0 4 5 6 7 9 13

P2
A C EE 7

Busy Communication Idle


Programming Level
Summary
In partitioning a task into subtasks two issues
must be taken into consideration:
• Grain Size, and
• Latency
Program Graph
Grain Packing
Task duplication
Programming Level
Loop Scheduling
Loops are the largest source of parallelism in a
program. Therefore, there is a need to pay
attention to the partitioning and allocation of
loop iterations among processors in a parallel
environment.
Programming Level
Loop Scheduling
Practically we can talk about three classes of
loops:
• Sequential Loops,
• Doall Loops — vector (parallel) loops,
• Doacross Loops — Loop with intermediate degree
of parallelism.
Programming Level
Loop Scheduling — Doall Loops
Static Scheduling
Dynamic Scheduling
Programming Level
Loop Scheduling — Doall Loops
Static Scheduling schemes assign a fixed
number of iterations to each processor:
• Block Scheduling (Static chunking),
• Cyclic Scheduling
In practice, cyclic scheduling offers a better
load balancing than block scheduling if the
computation performed by each iteration varies
significantly.
Programming Level
Loop Scheduling — Doall Loops
Dynamic scheduling schemes have been
proposed to respond to the imbalance work-
load of the static scheduling schemes:
• Self Scheduling Scheme
• Fixed size chunking Scheme
• Guided Self Scheduling Scheme
• Factoring Scheme
• Trapezoid Self Scheduling Scheme
Programming Level
Loop Scheduling
Practice has shown that the lost of parallelism
after serializing Doacross loops is very
significant — Chen & Yew (1991).
There is a need to develop good schemes for
the parallel execution of Doacross loops.
Programming Level
Loop Scheduling — Doacross Loops
In a Doacross loop, iteration may be either data
or control dependent on each other:
• Control dependence is caused by conditional
statements.
• Data dependence appears in the form of sharing
computational results. Data dependence can be
either in the form of lexically-forward or lexically-
backward.
Programming Level
Loop Scheduling — Doacross Loops
Doacross loops can be either:
• regular (dependence distance is fixed) or
• irregular (dependence distance varies from iteration
to iteration).

Regular Doacross loops are more amenable to


parallel execution than irregular loops.
Programming Level
Loop Scheduling — Doacross Loops
DOACROSS Model — Cytron 1986
Pre-synchronization Scheduling — Krothapalli
& Sadayappan 1990
Staggered Distribution Scheme — Lim et. al.,
1992
Programming Level
P1 P2 P3
0 •
2 •
Loop Scheduling 4 • i1
6 •
DOACROSS Model — Example
8 • i2
T 10 •
12 • i3
I 14 •
16 • i4
M 18 •
20 •
i5
E 22 •
24 • i6
26 •
28 • i7
30 •
32 • i8
34 •
36 •
38 •
Programming Level
Loop Scheduling
DOACROSS model was aimed to model the
execution of the sequential loops, vector loops,
and loops with intermediate parallelism by
considering the:
• control dependencies, and
• data dependencies.
Programming Level
Loop Scheduling
Control dependencies are caused by conditional
and unconditional branches.
Data dependencies are due to the lexically
forward and lexically backward dependencies.
This simple model, however, does not consider
the effect of inter-processor communication
cost.
Programming Level
Partitioning and Scheduling
A proper allocation scheme should partition a
task (program graph) into subtasks (sub-graphs)
and distributes the subtasks among processors
in an attempt to minimize:
• processor contentions, and
• inter-processor communications.
Programming Level
Partitioning and Scheduling
There are two main allocation schemes:
• Static, and
• Dynamic.

You might also like