Assignments Across Time in Distributed Processor: Algorithm Optimal

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-7, NO.

6, NOVEMBER 1981 583

A Shortest Tree Algorithm for Optimal Assignments


Across Space and Time in a Distributed
Processor System
SHAHID H. BOKHARI, MEMBER, IEEE

Abstract-The problem of optimally assigning the modules of a pro- Research by Stone [51, [7] and Bokhari [1], [2] has shown
gram over the processors of an inhomogeneous distributed processor how the optimal assignment may be found efficiently for the
system is analyzed. The objective is to assign modules, wherever pos- case of dual processor systems using a network flow algorithm.
sible, to the processors on which they execute most rapidly while taking
into account the overhead of interprocessor communication. Factors While an extension to three processors was developed by Stone
contributing to the cost of an assignment are 1) the amount of compu- [6], algorithms for four or more processors have not been
tation required by each module, 2) the amount of data transmitted be- found. Gursky [3] has shown that the problem of finding the
tween each pair of modules, 3) the speed of each processor, and 4) the optimal assignment for four or more processors isNP-complete.
speed of the communication link between each pair of processors. In Section II of this paper we show that if the intermodule
A shortest tree algorithm is described that minimizes the sum of exe-
cution and communication costs for arbitrarily connected distributed
communication pattern of the program is constrained to be
systems with arbitrary numbers of processors, provided the intercon- a tree, then the problem may be solved for an arbitrary num-
nection pattern of the modules forms a tree. The algorithm uses a ber of processors using an efficient dynamic programming
dynamic programming approach to solve the problem for m modules approach. Programs that have a tree-like structure form an
and n processors in O(mn2) time. important class and include programs written as a hierarchy
It is shown that this algorithm may also be used to optimally schedule of subroutines. Turner [8] has discussed how a tree-like
a number of tasks whose precedence relationship forms a tree over a set
of processors whose costs of execution and intercommunication vary structure is best suited for large modular programs.
with time. The motivation in this case is to distribute the task over the The algorithm takes into account the interconnection struc-
processors, delaying the execution of tasks wherever deadlines permit ture of the distributed system (i.e., the speeds of links between
so as to take advantage of periods of light loading of specific processors pairs of processors) a consideration that does not arise in the
and communication links.
In this case the algorithm may be used to minimize the sum of execu- dual processor case.
tion costs (processor and time dependent), communication costs (de- The algorithm minimizes the sum of module execution and
pendent on the characteristics of specific links and on time), and the interprocessor communication costs. The two costs must be
penalties for not meeting deadlines (if any). expressed in the same units which could be time or dollars or
other resource units. If the costs are expressed in dollars, then
Index Terms-Computer networks, distributed processing, precedence
trees, program assignment, scheduling, shortest trees.
the algorithm will minimize the total financial cost of exe-
cuting the program. If expressed in terms of module execu-
tion and interprocessor communication time it will minimize
I. INTRODUCTION the total execution time assuming serial execution of the pro-
OVER the past few years interests in distributed process- gram. That is, even though there are several modules and
ing has led to the identification of several challenging processors, only one module is active on one processor at a
problems. One of these is the problem of optimally assigning time. This is the case in some distributed processing applica-
the modules of a modular program over the processors of an tions including the distributed processor at Brown University
inhomogeneous distributed processor system. Factors con- [4], [9] which has served as a model for prior research by
tributing to the cost of an assignment are 1) the amount of Stone and Bokhari.
computation required by each module, 2) the amount of data In Section III of this paper we show how the dynamic pro-
transmitted between each pair of modules, 3) the speed of gramming approach may also be used to optimally schedule
each processor, and 4) the speed of the communication link a number of tasks whose precedence relationship forms a tree
between each pair of processors. over a set of processors whose costs of execution and inter-
communication vary with time. The motivation in this case
Manuscript received July 5, 1979; revised January 14, 1981. This is to distribute the tasks over the processors, delaying their
work was supported by the National Science Foundation under Grant execution wherever deadlines permit so as to take advantage
MCS-76-11650 while the author was at the Department of Electrical of periods of light loading of specific processors and communi-
and Computer Engineering, University of Massachusetts, Amherst,
and by NASA under Contract NAS-1-14101 while the author was at cation links.
the Institute for Computer Applications in Science and Engineering In this case the algorithm may be used to minimize the sum
(ICASE), NASA Langley Research Center, Hampton, VA. of execution costs (processor and time dependent), communi-
The author is with the Department of Electrical Engineering, Univer-
sity of Engineering and Technology, Lahore, Pakistan. cation costs (dependent on the characteristics of specific links

0098-5589/81/1100-0583$00.75 1981 IEEE


584 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-7, NO. 6, NOVEMBER 1981

and on time), and the penalties for not meeting deadlines


(if any).
We assume all processors to be dissimilar. For example,
a specific processor may have a hardware floating-point
unit and thus be able to carry out floating-point operations
with higher speed than a processor without such hardware.
Similarly, some processors may be able to do byte manipula-
tion more efficiently than others.
Although all processors are dissimilar, each module is able,
in general, to execute on any processor. This is possible if all
modules are written in a high-level language and separate
object versions of these modules are available for each pro-
cessor. This is the case in the Brown University System.
Alternatively, the distributed system may be made up of Fig. 1. An invocation tree.
machines which have varying computational power but can
execute essentially the same instruction set. A system based .calls graph. The algorithm for optimal assignments that will
on the PDP-1 1 family of machines is one example. The com- be presented in this section assumes that the calls graph of
plex of computers at NASA Langley Research Center, where the program is a directed tree. We will call this an invocation
there are several CDC machines varying from a CDC 6400 to tree because it describes the way modules invoke other modules
a Cyber 175, is another example. during the execution of the program. Fig. 1 shows an invoca-
tion tree made up of eight modules.
II. DISTRIBUTING ACROSS SPACE Should a module invoke another module that is not coresi-
In this section we examine the problem of optimally distrib- dent with it, this invocation would have to be transmitted over
uting a modular program over the processors of a distributed a communication link and thus incur interprocessor communi-
processing system. We call this the problem of distributing cation cost. This is dependent on the amount of data trans-
across space (i.e., the space of processors). mitted from one module to the other and the cost per bit of
transmission between the two processors on which the modules
A. Formulation of the Problem are resident.
Our distributed processor system is assumed to be made We will denote by di the total amount of data transmitted
up of an arbitrary number n of dissimilar processors. These between the calling module i and the called module j during
processors are assumed to be interconnected in an arbitrary the lifetime of the program. The cost of transmitting an
fashion, with arbitrary link speeds. amount of data di, between processors p and q is given by
The program to be distributed across the processors is con- a function Spq(di;). In its simplest form this function is
sidered to be made up of a number of modules. A module is Spq(dij) = Spq * di1 where Spq is the cost of transmitting a unit
considered to be a portion of a program that can, in general, amount of data between processors p and q.
execute on any processor and could, for example, be a sub- We assume the communication cost function to be sym-
routine or coroutine. There are assumed to be m modules in metric, i.e., Spq = Sqp . This assumption permits us to associate
the distributed program. the sum of all data flow between modules i and j (whether
For each module we have the cost of executing it on each from i to i or j to i) with the direction i to I. Although we see
of the n-processors. The cost of executing module i on pro- no motivation for considering the case Spq Sqp, the algo-
:

cessor j is denoted ei1 and equals the sum of the costs of the rithm described below can easily be modified to handle this
various periods of execution of the module throughout the case (for which separate di's and d1i's will be required.)
lifetime of the program (since, for example, a subroutine is The cost of invoking a coresident module is assumed to be
typically executed several times during a program run). zero, i.e., Spp = 0. Should this assumption not be valid, it too
The eii's for an m module, n processor problem form an can be relaxed to account for nonnegligible intraprocessor
m X n matrix. Since the processors are dissimilar, the cost communication costs.
of executing a module varies from processor to processor, There is a nonzero di1 associated with each directed edge
that is, eij $ eik, in general. A module may be constrained from node i to node j in the invocation tree of our program.
to reside on a subset of available processors (perhaps on only
one processor) by making its execution costs on the comple- B. Minimum Cost Assignment Across Space
mentary subset infinite. We now show how the minimum cost assignment for our
Modules will transfer control to each other at various points modular program may be found. Such an assignment mini-
during the lifetime of the program. If we draw up a directed mizes the sum of execution costs and interprocessor communi-
graph in which each node represents a module and in which catibn costs.
there is an edge from node i to node / if and only if module Given the invocation tree of a modular program, and the
i calls module j during program execution, this would be a execution and interprocessor communication costs, we may
BOKHARI: SHORTEST TREE ALGORITHM FOR OPTIMAL ASSIGNMENTS 585
s
It follows from property 4) that to each assignment of the
m modules to the n processors there corresponds some subset
of nodes of the assignment graph. The subgraph generated
by these nodes plus the source and terminal nodes is called
an assignment tree and has the following properties.
1) It is a tree.
2) It connects the source node s to all terminal nodes
tl, t2,
3) It contains one and only one node from each layer of
the assignment graph.
There is a one to one correspondence between assignment
tl trees and module assignments. Furthermore, the weight of
each assignment tree (i.e., the sum of the weights of all edges
t3
in it) equals the cost of the corresponding assignment. This
follows from property 9) of assignment graphs. The thick
edges in Fig. 2 represent an assignment tree which is shoWn
in isolation in Fig. 3.
To find a minimum cost assignment we need to find the
t2 minimum weight assignment tree in the assignment graph.
Fig. 2. An assignment graph for the invocation tree of Fig. 1 and a This may be done using the dynamic programming approach
three-processor system. Thick edges form an assignment tree. Ovals described in the Appendix in O(mn2) time.
have been drawn around the two forksets.
III. DISTRIBUTION ACROSS SPACE AND TIME
draw up an assignment graph. Fig. 2 shows the assignment A. Motivations
graph for the invocation tree of Fig. 1 and a three-processor In this section we discuss how a set of tasks that must be
system. executed according to a tree-like precedence relationship may
The following definitions apply to this assignment graph. be optimally scheduled over a distributed processor system in
1) The assignment graph is a directed graph with weighted which costs vary with time.
edges. The computational resources of many organizations are in
2) There is one distinguished node called the source node, the form of a distributed computer network with each com-
denoted s. puter servicing one or more high priority local tasks and retain-
3) There are several terminal nodes tl, t2< one for each
, ing the capability of servicing remotely submitted tasks at a
leaf node of the invocation tree. low priority. In a production environment, the loads on the
4) In addition to the source and terminal nodes there are computers are fairly predictable as they depend heavily on
m X n further nodes in the assignment graph (for a problem the specific local loads on the machines.
involving m modules and n processors). Each node is labeled In such an environment we would be interested in distrib-
with a pair of numbers (i,j) and represents the assignment of uting a large set of tasks over the system such that it utilizes
module i to processor j. whatever processors are lightly loaded. Since the loads on the
5) Each layer of the assignment graph corresponds to a node processors are time dependent, we may wish to consider sus-
of the invocation tree. For example, the layer comprising pending some tasks until a time when the load on a particular
nodes (2, 1), (2, 2), and (2, 3) corresponds to node 2 of the processor is very light. Of course, some tasks may be time
invocation tree. critical and not admit any such suspension. Our problem thus
6) Nodes in layers corresponding to nodes in the invocation becornes one of distributing a set of tasks over a set of pro-
tree having outdegree greater than 1 are called forknodes. cessors, taking into account the run cost of each task on each
Each layer of forknodes is called a forkset. processor (which will, in general, be time-dependent), the
The edges have weights on them according to the following intercommunication costs between pairs of tasks (which will
rules. depend on the pairs of processors on which the two tasks are
7) All edges incident on the terminal nodes tl, t2, etc. have executed and may also be time dependent) and the penalties
zero weight on them. for not meeting deadlines for tasks (which may be set to in-
8) Edges joining sourcenode S to nodes (1, 1), (1, 2)," finity if a task must be executed by a certain time).
have weights el1, e12,* As a concrete example of a situation where such scheduling
9) The edge joining node (i, p) to node (j, q) has weight may be useful, consider the setting up of an appointment for
ejq Spq(djj). For example, the weight on the edge joining
+ a student to see a specific doctor at a University Infirmary.
node (1, 3) to (2, 1) is e2l +S13(dl2). This equals the cost This task is to be run at the University's central administrative
of assigning module 2 to processor 1, given that module 1 has computing center (which is made up of one or more time-
been assigned to processor 3. shared machines). The task involves the updating of a specific
586 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-7, NO. 6, NOVEMBER 1981

cated to real-time simulation during the day but may be used


at lower cost during the evenings and at night.

B. Formulation of the Problem


We assume there to be n processors in our distributed system.
The costs of executing tasks on processors vary with time but
remain constant over specified periods of time called phases.
This is illustrated in Fig. 4 where the vertical axis represents
the space of processor and the horizontal axis represents time.
During some phases a processor may be totally unavailable
because of complete dedication to a real-time task or perhaps
because of scheduled maintenance.
Once a task starts executing during a particular phase, it is
allowed to run to completion even if the phase ends during
tl
task execution. The time to execute a task is considered to
e
be small compared to the length of a phase. A task that is
initiated near the end of a phase is treated like a customer
who arrives at a bank just before closing time-he is allowed
to complete his transaction even though closing time is past.
The graph superimposed on Fig. 4 shows one possible way
of scheduling a precedence tree of tasks over the two dimen-
sions of space (processors) and time (phases).
In the usual fashion, each node of the precedence tree repre-
t2
sents a task and a directed edge from node i to node f implies
Fig. 3. The assignment tree from Fig. 2 shown in isolation with edges that task i must be completed before task j is started.
labeled.
With each node i is associated eiik, the cost of executing
task i on processor j during phase k. This cost will, in general,
doctor's appointment file, the scheduling of an examination vary across the processors and phases. It may be set to in-
room, mailing a remainder to the student, updating the stu- finity for some processors during certain phases if these pro-
dent's account file, etc. Some tasks (e.g., updating the doctor's cessors are not available during these specific phases.
appointment file) must be done immediately, while others Also associated with each node i is Fjk, the penalty for not
(updating the student's account) may be deferred until a later completing task i by the end of phase k. This penalty Fik may
time. be set to infinity if the task must finish by phase k.
In an airlines reservation computer system, the prime objec- With the edge connecting nodes i and i in the precedence
tive is to provide fast reliable service to the numerous ticketing tree, we have di-the amount of data that must be transmitted
terminals. However, there is no reason why such a system can- between tasks when invoking task i at the end of task i. The
not be used for other functions as well. Tasks other than overhead of invoking task j in phase /,, on processor py after
ticketing need to be scheduled over this system so as to mini- completing task i in phase 4, on processor p, is a function
mally impact the efficiency of ticketing by deferring as much T(di , p vw). This function can take into account 1)
I

as possible such tasks to off-peak load hours.


the amount of data transmitted, 2) the cost per bit of the link
Another example is where a long engineering computation between px and py, and 3) the overhead of suspending a task
(comprising several steps to be executed in a sequence) is to when it finishes in 0, and resumes in Ow. A simple function,
be carried out at a large computation laboratory with several which ignores 3) above is T = S,y * di1 where S,y is the cost
computers, some of which are used for real-time simulation of transmitting a unit of data between processors px and py .
during the day. An engineer carrying out such a calculation
will often do the initial data preparation during the day and C. Solution
then suspend his task until night time when it may be run The precedence tree of tasks may be scheduled to mini-
at very low cost on a lightly loaded machine. The final in- mize the sum of execution costs, interprocessor communica-
teractive examination of the results is done on the follow- tion costs, penalties for not meeting deadlines, and costs of
ing day. suspending and resuming tasks. The solution techniques are
The complex of computers at NASA Langley Research very similar to the optimal assignment approach of Section II.
Center is an example of this kind of environment. There are Fig. 5 shows a scheduling graph that has m * n q nodes for
-

at this center several CDC Cyber machines with varying com- a problem based on m modules, n processors, and k phases.
putational power but capable of executing essentially the same This is a weighted directed graph. (Fig. 5 omits the arrow
instruction set. Some powerful machines are completely dedi- heads on the edges for clarity-there is no ambiguity of direc-
BOKHARI: SHORTEST TREE ALGORITHM FOR OPTIMAL ASSIGNMENTS 587

SPACE

PROCESSOR 7

PROCESSOR 2

PROCESSOR 1

PHASE 1 PHASE 2 PHASE 3 PHASE 4 PHASE 5 TIME


Fig. 4. Scheduling a precedt~e,nce tree over space and time.
tion since all edges consistently point away from node s.) This a tree-like structure over a distributed processor system and 2)
graph is drawn up according to the following rules. to optimally schedule- a set of tasks that have a tree-like prece-
1) Node (i, j, k) represents the execution of task i on pro- dence relationship over a distributed processor system in
cessor / during phase k. which costs vary with time but are constant over contiguous
2) Edges joining node s to nodes 111, 121, 112, etc. have periods of time called phases.
weights ell,, e12, e112, etc. For the first case the algorithm has O(mn2) complexity
3) Edges incident on nodes tl, t2 ,... have weight zero. where m is the number of modules and n is the number of
4) There is an edge joining node (i, px, f) to node (j, pv, processors. We have thus shown the unsolved problem of
Ow) if node i precedes node j in the precedence tree and if optimally assigning a modular program over more than three
Ow > u. processors to be solvable for the important class of programs
5) The weight on the edge described above is e. Py<w + in which the module interconnection structure is a tree.
T(dJpp,Px, P 1,Ou, OW) + Fi, ,Wl . The first term 'takes into In the second case the algorithm takes O(nn2.2) time where
account the cost of execution, and the second term is the ' is the number of phases.
interprocessor and interphase communication overhead. The This algorithm does not take into account the capacity of
last term represents the penalty for not completing this task each processor which may be a consideration in practice.
by the end of the previous phase.
The similarity between this scheduling graph and the assign- APPENDIX
ment graph of the previous section is obvious. In fact the THE SHORTEST TREE ALGORITHM
An algorithm to find 'the shortest assignment tree in an
scheduling graph- may be considered to be an assignment graph
based on m modules and n * processors. Thus, the shortest assignment graph is presented in this section. At the heart of
this aigorithm is a procedure that will find the shortest paths
tree in the scheduling graph, which would correspond to the
from a terminal node of the assignment graph to all nodes in
optimal schedule, may be found using the algorithm described
in the Appendix in O(mn2q2) time. the nearest forkset (Fig. 6). This may be done using dynamic
programming in O(kn2) time, where k is the number of layers
IV. CONCLUSIONS between the terminal node and the forkset involved. (From
pbrogramming algorithm has been presented that each node in a layer we label all nodes in the preceding
A dynamic

may be u'sed 1) optimally assign modular program that has layer-this takes 0(n2) time. This labeling is repeated k times.)
to a
588 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-7, NO. 6, NOVEMBER 1981

2 3

Fig. 5. A precedence tree and its scheduling graph for a two-processor


two-phase problem.

Let us call this procedure SHORT and assume that it leaves remove f from FSET;
pointers from each node to the next node in the shortest path add tf to TSET;
to the terminal node.
We will call a forkset "exposed" when the shortest paths end;
from its nodes to all possible nodes have been found. end;
find the shortest path from the last terminal node to s;
begin
(* the length of this path equals the weight of the
input graph; shortest tree *)
(* TSET is the set of all terminal nodes *)
reconnect all disconnected edges;
(*FSET is the set of all forksets *) traverse graph from s to all terminal nodes by following
while ITSETI > 1 do pointers set up by procedure SHORT; (* each node
begin encountered is part of the shortest tree *)
to each terminal node t in TSET apply procedure end.
SHORT and remove t from TSET;
Fig. 6 shows an assignment graph just after the application of
for each exposed forkset f in FSET do procedure SHORT to terminal nodes t1 and t2. In Fig. 7 we
begin have temporarily removed the two limbs of the graph and
temporarily disconnect all outgoing edges; created a pseudoterminal node to. The weight on the edge
joining to to a node in the forkset equals the sum of the
create a pseudoterminal node tf; shortest paths from that node to tI and t2 from Fig. 6. After
join all nodes in f to tf with edges that finding the shortest path from s to to we reconnect the two
have weights equal to the sum of the limbs of the graph to obtain the shortest tree as shown in
several shortest paths to the several Fig. 8.
discarded terminal nodes; This algorithm is applicable to arbitrary assignment graphs.
..

BOKHARI: SHORTEST TREE ALGORITHM FOR OPTIMAL ASSIGNMENTS 589

ACKNOWLEDGMENT
The author wishes to thank Prof. H. Stone for his encourage-
ment of this research. Comments of the referees helped to
reshape this paper significantly.
REFERENCES
[1] S. H. Bokhari, "Dual processor scheduling with dynamic reassign-
ment," IEEE Trans. Software Eng., vol. SE-5, pp. 341-349, July
1979.
[2] -, "Optimal assignments in dual processor distributed systems
under varying load conditions," IEEE Trans. Software Eng., to
be published.
[3] M. Gursky, private communication.
[4] J. Michel and A. van Dam, "Experience with distributed processing
on a host/satellite graphics system, in Proc. SIGGRAPH '76; avail-
Fig. 6. Shortest paths from t, and t2 to ali the nodes in the forkset. able as Computer Graphics (SIGGRAPH Newsletter), vol. 10, no.
2, 1976.
[5] H. S. Stone, "Multiprocessor scheduling with the aid of network
flow algorithms," IEEE Trans. Software Eng., vol. SE-3, pp. 85-93,
Jan. 1977.
[6] -, "Program assignment in three-processor systems and tricutset
partitioning of graphs," Dep. Elec. Comput. Eng., Univ. Massa-
chusetts, Amherst, Tech. Rep. ECE-CS-77-7, 1977.
[7] -, "Critical load factors in distributed systems," IEEE Trans.
Software Eng., vol. SE-4, pp. 254-258, May 1978.
[8] J. Turner, "The structure of modular programs," Commun. Ass.
Comput. Mach., vol. 23, pp. 272-277, May 1980.
[9] A. van Dam, G. Stabler, and R. Harrington, "Intelligent satellites
for interactive graphics," Proc. IEEE, vol. 62, pp. 83-92, Apr.
1974.
Fig. 7. Transformed graph with shortest path from pseudoterminal
node to to s.

Shahid H. Bokhari (S'75-M'78) was born in


Lahore, Pakistan, in 1953. He received the
B.Sc. degree in electrical engineering from the
University of Engineering and Technology,
Lahore, Pakistan, in 1974, and the M.S. and
Ph.D. degrees in electrical and computer engi-
neering from the University of Massachusetts,
Amherst, in 1976 and 1978, respectively.
In 1974 he worked as a Research Associate
in the Department of Electrical Engineering
at the Pakistan University of Engineering and
Technology. From 1975 to 1978 he was a Research Assistant at the
tI t2 Department of Electrical and Computer Engineering, University of
Fig. 8. The shortest assignment tree. Massachusetts. He was a Research Scientist at the Institute for Com-
puter Applications in Science and Engineering (ICASE) at NASA
Langley Research Center, Hampton, VA, from July 1978 to December
Each application of procedure SHORT takes O(kn2) time. 1979. He is currently an Assistant Professor in the Department of Elec-
The total time is Li O(kin2), where i represents each applica- trical Engineering, University of Engineering and Technology, Lahore,
tion of procedure SHORT. This equals 0(n2) Li ki = O(mn2) Pakistan. His research interests include computer architecture, dis-
tributed processing, and performance evaluation.
since the total number of layers in the graph is m. Dr. Bokhari is a member of the Association for Computing Machinery.

.. .. k7 _. ..
I
- .,

You might also like