To appear in Journal of Parallel and Distributed Computing, 1994.
Analyzing Scalability of Parallel Algorithms and
Architectures
Vipin Kumar and Anshul Gupta
Department of Computer Science
University of Minnesota
Minneapolis, MN - 55455
[email protected] and
[email protected]
TR 91-18, November 1991 (Revised July 1993)
Abstract
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best
algorithm-architecture combination for a problem under di erent constraints on the growth of the problem
size and the number of processors. It may be used to predict the performance of a parallel algorithm and
a parallel architecture for a large number of processors from the known performance on fewer processors.
For a xed problem size, it may be used to determine the optimal number of processors to be used and
the maximum possible speedup that can be obtained. The objective of this paper is to critically assess
the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and
architectures. We survey a number of techniques and formalisms that have been developed for studying
scalability issues, and discuss their interrelationships. For example, we derive an important relationship
between time-constrained scaling and the isoeciency function. We point out some of the weaknesses of
the existing schemes for measuring scalability, and discuss possible ways of extending them.
This work was supported by IST/SDIO through the Army Research Oce grant # 28408-MA-SDI to the University
of Minnesota and by the Army High Performance Computing Research Center at the University of Minnesota. An
earlier version of this paper appears in the Proceedings of the 1991 International Conference on Supercomputing,
Cologne, Germany, June 1991. A short version also appeared as an invited paper in the Proceedings of the 29th
Annual Allerton Conference on Communication, Control and Computing, Urbana, IL, October 1991
1
1 Introduction
At the current state of technology, it is possible to construct parallel computers that employ thousands of processors. Availability of such systems has fueled interest in investigating the performance
of parallel computers containing a large number of processors. While solving a problem in parallel,
it is reasonable to expect a reduction in execution time that is commensurable with the amount of
processing resources employed to solve the problem. The scalability of a parallel algorithm on a
parallel architecture is a measure of its capacity to e ectively utilize an increasing number of processors. Scalability analysis of a parallel algorithm-architecture combination can be used for a variety of
purposes. It may be used to select the best algorithm-architecture combination for a problem under
di erent constraints on the growth of the problem size and the number of processors. It may be used
to predict the performance of a parallel algorithm and a parallel architecture for a large number of
processors from the known performance on fewer processors. For a xed problem size, it may be
used to determine the optimal number of processors to be used and the maximum possible speedup
that can be obtained. The scalability analysis can also predict the impact of changing hardware
technology on the performance and thus help design better parallel architectures for solving various
problems.
The objective of this paper is to critically assess the state of the art in the theory of scalability
analysis, and to motivate further research on the development of new and more comprehensive
analytical tools to study the scalability of parallel algorithms and architectures. We survey a number
of techniques and formalisms that have been developed for studying the scalability issues, and discuss
their interrelationships. We show some interesting relationships between the technique of isoeciency
analysis [29, 13, 31] and many other methods for scalability analysis. We point out some of the
weaknesses of the existing schemes, and discuss possible ways of extending them.
The organization of this paper is as follows. Section 2 describes the terminology that is followed
in the rest of the paper. Section 3 surveys various metrics that have been proposed for measuring
the scalability of parallel algorithms and parallel architectures. Section 4 reviews the literature on
the performance analysis of parallel systems. Section 5 describes the relationships among the various
scalability metrics discussed in Section 3. Section 6 analyzes the impact of technology dependent
factors on the scalability of parallel systems. Section 7 discusses scalability of parallel systems when
the cost of scaling up a parallel architecture is also taken into account. Section 8 contains concluding
remarks and directions for future research.
2 De nitions and Assumptions
Parallel System: The combination of a parallel architecture and a parallel algorithm implemented
on it. We assume that the parallel computer being used is a homogeneous ensemble of processors; i.e., all processors and communication channels are identical in speed.
Problem Size W : The size of a problem is a measure of the number of basic operations needed
to solve the problem. There can be several di erent algorithms to solve the same problem. To
keep the problem size unique for a given problem, we de ne it as the number of basic operations
2
required by the fastest known sequential algorithm to solve the problem on a single processor.
Problem size is a function of the size of the input. For example, for the problem of computing
an N -point FFT, W = (N log N ).
According to our de nition, the sequential time complexity of the fastest known serial algorithm
to solve a problem determines the size of the problem. If the time taken by an optimal (or the
fastest known) sequential algorithm to solve a problem of size W on a single processor is TS ,
then TS / W , or TS = tc W , where tc is a machine dependent constant.
Serial Fraction : The ratio of the serial component of an algorithm to its execution time on one
s
processor. The serial component of the algorithm is that part of the algorithm which cannot
be parallelized and has to be executed on a single processor.
Parallel Execution Time P : The time elapsed from the moment a parallel computation starts,
T
to the moment the last processor nishes execution. For a given parallel system, TP is normally
a function of the problem size (W ) and the number of processors (p), and we will sometimes
write it as TP (W; p).
Cost: The cost of a parallel system is de ned as the product of parallel execution time and the
number of processors utilized. A parallel system is said to be cost-optimal if and only if
the cost is asymptotically of the same order of magnitude as the serial execution time (i.e.,
pTP = (W )). Cost is also referred to as processor-time product.
Speedup : The ratio of the serial execution time of the fastest known serial algorithm ( S ) to
S
T
the parallel execution time of the chosen algorithm (TP ).
Total Parallel Overhead o: The sum total of all the overhead incurred due to parallel processing
T
by all the processors. It includes communication costs, non-essential work and idle time due
to synchronization and serial components of the algorithm. Mathematically, To = pTP ? TS .
In order to simplify the analysis, we assume that To is a non-negative quantity. This implies
that speedup is always bounded by p. For instance, speedup can be superlinear and To can be
negative if the memory is hierarchical and the access time increases (in discrete steps) as the
memory used by the program increases. In this case, the e ective computation speed of a large
program will be slower on a serial processor than on a parallel computer employing similar
processors. The reason is that a sequential algorithm using M bytes of memory will use only
Mp bytes on each processor of a p-processor parallel computer. The core results of the paper
are still valid with hierarchical memory, except that the scalability and performance metrics
will have discontinuities, and their expressions will be di erent in di erent ranges of problem
sizes. The at memory assumption helps us to concentrate on the characteristics of the parallel
algorithm and architectures, without getting into the details of a particular machine.
For a given parallel system, To is normally a function of both W and p and we will often write
it as To (W; p).
Eciency : The ratio of speedup ( ) to the number of processors ( ). Thus, = pTTSP = 1+1TTo .
S
E
S
p
3
E
Degree of Concurrency ?(W ): The maximum number of tasks that can be executed simultaneously at any given time in the parallel algorithm. Clearly, for a given W , the parallel algorithm
cannot use more than ?(W ) processors. ?(W ) depends only on the parallel algorithm, and is
independent of the architecture. For example, for multiplying two N N matrices using Fox's
parallel matrix multiplication algorithm [12], W = N 3 and ?(W ) = N 2 = W 2=3. It is easily
seen that if the processor-time product [1] is (W ) (i.e., the algorithm is cost-optimal), then
?(W ) (W ).
3 Scalability Metrics for Parallel Systems
It is a well known fact that given a parallel architecture and a problem instance of a xed size, the
speedup of a parallel algorithm does not continue to increase with increasing number of processors.
The speedup tends to saturate or peak at a certain value. As early as in 1967, Amdahl [2] made the
observation that if s is the serial fraction in an algorithm, then its speedup is bounded by 1s , no matter
how many processors are used. This statement, now popularly known as Amdahl's Law, has been
used by Amdahl and others to argue against the usefulness of large scale parallel computers. Actually,
in addition to the serial fraction, the speedup obtained by a parallel system depends upon a number
of factors such as the degree of concurrency and overheads due to communication, synchronization,
redundant work etc. For a xed problem size, the speedup saturates either because the overheads
grow with increasing number of processors or because the number of processors eventually exceeds
the degree of concurrency inherent in the algorithm. In the last decade, there has been a growing
realization that for a variety of parallel systems, given any number of processors p, speedup arbitrarily
close to p can be obtained by simply executing the parallel algorithm on big enough problem instances
[47, 31, 36, 45, 22, 12, 38, 41, 40, 58].
Kumar and Rao [31] developed a scalability metric relating the problem size to the number of
processors necessary for an increase in speedup in proportion to the number of processors. This metric
is known as the isoeciency function. If a parallel system is used to solve a problem instance of
a xed size, then the eciency decreases as p increases. The reason is that To increases with p. For
many parallel systems, if the problem size W is increased on a xed number of processors, then the
eciency increases because To grows slower than W . For these parallel systems, the eciency can be
maintained at some xed value (between 0 and 1) for increasing p, provided that W is also increased.
We call such systems scalable1 parallel systems. This de nition of scalable parallel algorithms is
similar to the de nition of parallel e ective algorithms given by Moler [41].
For di erent parallel systems, W should be increased at di erent rates with respect to p in order
to maintain a xed eciency. For instance, in some cases, W might need to grow as an exponential
function of p to keep the eciency from dropping as p increases. Such parallel systems are poorly
scalable. The reason is that on these parallel systems, it is dicult to obtain good speedups for a
large number of processors unless the problem size is enormous. On the other hand, if W needs to
For some parallel systems ( e.g., some of those discussed in [48] and [34]), the maximum obtainable eciency E max
is less than 1. Even such parallel systems are considered scalable if the eciency can be maintained at a desirable
value between 0 and E max .
1
4
grow only linearly with respect to p, then the parallel system is highly scalable. This is because it
can easily deliver speedups proportional to the number of processors for reasonable problem sizes.
The rate at which W is required to grow w.r.t. p to keep the eciency xed can be used as a
measure of scalability of the parallel algorithm for a speci c architecture. If W must grow as fE (p)
to maintain an eciency E , then fE (p) is de ned to be the isoeciency function for eciency E and
the plot of fE (p) vs. p is called the isoeciency curve for eciency E . Equivalently, if the relation
W = fE (p) de nes the isoeciency curve for a parallel system, then p should not grow faster that
fE?1(W ) if an eciency of at least E is desired.
In Kumar and Rao's framework, a parallel system is considered scalable if its isoeciency function
exists; otherwise the parallel system is unscalable. The isoeciency function of a scalable system
could, however, be arbitrarily large; i.e., it could dictate a very high rate of growth of problem size
w.r.t. the number of processors. In practice, the problem size can be increased asymptotically only
at a rate permitted by the amount of memory available at each processor. If the memory constraint
does not allow the size of the problem to increase at the rate necessary to maintain a xed eciency,
then the parallel system should be considered unscalable from a practical point of view.
Isoeciency analysis has been found to be very useful in characterizing the scalability of a variety
of parallel systems [17, 24, 19, 31, 32, 34, 46, 48, 56, 55, 18, 15, 33, 14, 30]. An important feature of
isoeciency analysis is that in a single expression, it succinctly captures the e ects of characteristics
of a parallel algorithm as well as the parallel architecture on which it is implemented. By performing
isoeciency analysis, one can test the performance of a parallel program on a few processors, and
then predict its performance on a larger number of processors. For a tutorial introduction to the
isoeciency function and its applications, one is referred to [29, 13].
Gustafson, Montry and Benner [22, 20] were the rst to experimentally demonstrate that by
scaling up the problem size one can obtain near-linear speedup on as many as 1024 processors.
Gustafson et al. introduced a new metric called scaled speedup to evaluate the performance on
practically feasible architectures. This metric is de ned as the speedup obtained when the problem
size is increased linearly with the number of processors. If the scaled-speedup curve is good ( e.g.,
close to linear w.r.t. the number of processors), then the parallel system is considered scalable. This
metric is related to isoeciency if the parallel algorithm under consideration has linear or near-linear
isoeciency function. In this case the scaled-speedup metric provides results very close to those of
isoeciency analysis, and the scaled-speedup is linear or near-linear with respect to the number of
processors. For parallel systems with much worse isoeciencies, the results provided by the two
metrics may be quite di erent. In this case, the scaled-speedup vs. number of processors curve is
sublinear.
Two generalized notions of scaled speedup were considered by Gustafson et al. [20], Worley [58]
and Sun and Ni [50]. They di er in the methods by which the problem size is scaled up with the
number of processors. In one method, the size of the problem is increased to ll the available memory
on the parallel computer. The assumption here is that aggregate memory of the system will increase
with the number of processors. In the other method, the size of the problem grows with p subject
to an upper-bound on execution time. Worley found that for a large class of scienti c problems, the
time-constrained speedup curve is very similar to the xed-problem-size speedup curve. He found
5
that for many common scienti c problems, no more than 50 processors can be used e ectively in
current generation multicomputers if the parallel execution time was to be kept xed. For some
other problems, Worley found time-constrained speedup to be close to linear, thus indicating that
arbitrarily large instances of these problems can be solved in xed time by simply increasing p.
In [16], Gupta and Kumar identify the classes of parallel systems which yield linear and sublinear
time-constrained speedup curves.
Karp and Flatt [27] introduced experimentally determined serial fraction f as a new metric for
measuring the performance of a parallel system on a x-sized problem. If S is the speedup on a
?1=p . The value of f is exactly equal to the serial fraction
p-processor system, then f is de ned as 1=S
1?1=p
s of the algorithm if the loss in speedup is only due to serial component (i.e., if the parallel program
has no other overheads). Smaller values of f are considered better. If f increases with the number
of processors, then it is considered as an indicator of rising communication overhead, and thus an
indicator of poor scalability. If the value of f decreases with increasing p, then Karp and Flatt
[27] consider it to be an anomaly to be explained by phenomena such as superlinear speedup e ects
or cache e ects. On the contrary, our investigation shows that f can decrease for perfectly normal
programs. Assuming that the serial and the parallel algorithms are the same, f can be approximated
by pTToS .2 For a xed W (and hence xed TS ), f will decrease provided To grows slower than (p).
This happens for some fairly common algorithms such as parallel FFT on a SIMD hypercube [18]3.
Also, the parallel parallel algorithms for which f increases with p for a xed W are not uncommon.
For instance, for computing vector dot products on a hypercube [19] To > (p) and hence f increases
with p if W is xed. But as shown in [19], this algorithm-architecture combination has an isoeciency
function of (p log p) and can be considered quite scalable.
Zorbas et al. [61] developed the following framework to characterize the scalability of parallel systems. A parallel algorithm with a serial component Wserial and a parallelizable component Wparallel,
when executed on one processor, takes tc (Wserial + Wparallel ) time. Here tc is a positive constant. The
). However,
ideal execution time of the same algorithm on p processors should be tc (Wserial + Wparallel
p
due to the overheads associated with parallelization, it always takes longer in practice. They next
introduce the concept of an overhead function (p). A p-processor system scales with overhead (p)
) (p). The smallest
if the execution time TP on p processors satis es TP tc (Wserial + Wparallel
p
function (p) that satis es this equation is called the systems overhead function and is de ned by
TP
tc (Wserial+Wparallel =p) . A parallel system is considered ideally scalable if the overhead function remains
constant when the problem size is increased suciently fast w.r.t. p.
For any parallel system, if the problem size grows faster than or equal to the isoeciency function,
then (p) will be a constant making the system (according to Zorbas' de nition) ideally scalable.
Thus all parallel systems for which the isoeciency function exists, are scalable according to Zorbas'
de nition. If (p) grows as a function of p, then the rate of increase of the overhead function
determines the degree of unscalability of the system. The faster the increase, the more unscalable
the system is considered. Thus, in a way, this metric is complimentary to the isoeciency function.
It distinguishes scalable parallel systems from unscalable ones, but does not provide any information
2 Speedup = S = TS = TS p . For large p, f = 1 ? 1 = To +TS ? 1 = To .
TP
TS +To
S
p
pTS
p
pTS
3 The analysis in [18]
can be adapted for SIMD hypercube by making the message startup time equal to zero.
6
on the degree of scalability of an ideally scalable system. On the other hand, the isoeciency function
provides no information on the degree of unscalability of an unscalable parallel system. A limitation
of this metric is that (p) captures overhead only due to communication, and not due to sequential
parts. Even if a parallel algorithm is poorly scalable due to large sequential components, (p) can be
misleadingly low provided that the parallel system's communication overheads are low. For example,
if Wserial = WS and Wparallel = 0, (p) = tc (Wserial+TWP parallel =p) = t1c = (1) (i.e., the parallel system
is perfectly scalable!).
Chandran and Davis [8] de ned processor eciency function (PEF) as the upper limit on the
number of processors p than can be used to solve a problem of input size N such that the execution
time on p processors is of the same order as the ratio of the sequential time to p; i.e., TP = ( Wp ). An
inverse of this function called data eciency function (DEF) is de ned as the smallest problem
size on a given number of processors such that the above relation holds. The concept of data eciency
function is very similar to that of the isoeciency function.
Kruskal et al. [28] de ned the concept of Parallel Ecient (PE) problems which is related to
the concept of isoeciency function. The PE class of problems have algorithms with a polynomial
isoeciency function for some eciency. The class PE makes an important delineation between
algorithms with polynomial isoeciencies and those with even worse isoeciencies. Kruskal et al.
proved the invariance of the class PE over a variety of parallel computation models and interconnection schemes. An important consequence of this result is that an algorithm with a polynomial
isoeciency on one architecture will have a polynomial isoeciency on many other architectures as
well. There can however be exceptions; for instance, it is shown in [18] that the FFT algorithm has
a polynomial isoeciency on a hypercube but an exponential isoeciency on a mesh. As shown in
[18, 48, 19], isoeciency functions for a parallel algorithm can vary across architectures and understanding of this variation is of considerable practical importance. Thus, the concept of isoeciency
helps us in further sub-classifying the problems in class PE. It helps us in identifying more scalable
algorithms for problems in the PE class by distinguishing between parallel systems whose isoeciency
functions are small or large polynomials in p.
Eager, Zahorajan and Lazowska [9] introduced the average parallelism measure to characterize
the scalability of a parallel software system that consists of an acyclic directed graph of subtasks
with a possibility of precedence constraints among them. The average parallelism A is de ned as
the average number of processors that are busy during the execution time of the parallel program,
provided an unlimited number of them are available. Once A is determined, analytically or experimentally, the speedup and eciency on a p-processor system are lower bounded by (p+pA
A?1) and
A
respectively. The reader should note that the average parallelism measure is useful only if
(p+A?1)
the parallel system incurs no communication overheads (or if these overheads can be ignored). It
is quite possible that a parallel algorithm with a large degree of concurrency is substantially worse
than one with smaller concurrency and minimal communication overheads. Speedup and eciency
can be arbitrarily poor if communication overheads are present.
Marinescu and Rice [39] argue that a single parameter that depends solely on the nature of the
parallel software is not sucient to analyze a parallel system. The reason is that the properties of
the parallel hardware can have a substantial e ect on the performance of a parallel system. As an
7
example, they point out the limitations of the approach of Eager et al. [9] in which a parallel system is
characterized with the average parallelism of the software as a parameter. Marinescu and Rice develop
a model to describe and analyze a parallel computation on a MIMD machine in terms of the number
of threads of control p into which the computation is divided and the number of communication acts
or events g (p) as a function of p. An event is de ned as an act of communication or synchronization.
At a given time, a thread of control could either be actively performing the computation for which
the parallel computer is being used, or it could be communicating or blocked. The speedup of the
parallel computation can therefore be regarded as the average number of threads of control that are
active in useful computation at any time. The authors conclude that for a xed problem size, if
g(p) = (p), then the asymptotic speedup has a constant upper-bound. If g(p) = (pm)(m > 1),
then the optimal speedup is obtained for a certain number of processors popt and asymptotically
approaches zero as the number of processors is increased. For a given problem size, the value of popt
is dependent upon g (p). Hence g (p) provides a signature of a parallel computation and popt can be
regarded as a measure of scalability of the computation. The higher the number of processors that
can be used optimally, the more scalable the parallel system is. The authors also discuss scaling the
problem size linearly with respect to p and derive similar results for the upper-bounds on attainable
speedup. When the number of events is a convex function of p, then the number of
threads of control
+g (p))
(W
psmax that yields the maximum speedup can be derived from the equation, p = g (p) , where W is
the problem size and is the work associated with each event. In their analysis, the authors regard
, the duration of each event, as a constant. For many parallel algorithms, it is a function of p and
W . Hence for these algorithms, the expression for psmax derived under the assumption of a constant
might yield a result quite di erent from the actual psmax.
Van-Catledge [53] developed a model to determine the relative performance of parallel computers
for an application w.r.t the performance on selected supercomputer con gurations for the same
application. The author presented a measure for evaluating the performance of a parallel computer
called the equal performance condition. It is de ned as the number of processors needed in the
parallel computer to match its performance with that of a selected supercomputer. It is a function
of the serial fraction of the base problem, scaling factor and the base problem size on which the
scaling factor is applied. If the base problem size is W (in terms of the number of operations in the
parallel algorithm), and s is the serial fraction, then the number of serial and parallel operations
are sW and (1 ? s)W , respectively. Assume that by scaling up the problem size by a factor k, the
number of serial and parallel operations becomes G(k)sW and F (k)(1 ? s)W . If p0 is the number of
processors in the reference machine and t0c is its clock period (in terms of the time taken to perform
a unit operation of the algorithm), then the execution time of the problem on the reference machine
?s) ). Similarly, on a test machine with clock period t , using p processors
is TP0 = t0c W (G(k)s + F (k)(1
c
p
F
(k )(1?s)
will result in an execution time of TP = tc W (G(k)s + p ). Now the relative performance of the
test machine w.r.t. the reference machine will be given by TTPP . The value of p for which the relative
performance is one is used as a scalability measure. The fewer the number of processors that are
required to match the performance of the parallel system with that of the selected supercomputer,
the more scalable the parallel system is considered.
Van-Catledge considers a case in detail for which G(k) = 1 and F (k) = k. For this case, he
0
0
0
8
computes the equal performance curves for a variety of combinations of s, k and tc . For example it
is shown that a uniprocessor system is better than a parallel computer having 1024 processors, each
of which is 100 times slower, unless s is less than .01 or the scaling factor is very large. In another
example it is shown that the equal performance curve for a parallel system with 8 processors is
uniformly better than that of another system with 16 processors, each of which is half as fast. From
these results, it is inferred that it is better to have a parallel computer with fewer faster processors
than one with many slower processors. We discuss this issue further in Section 7 and show that this
inference is valid only for a certain class of parallel systems.
In [49, 21], Sun and Gustafson argue that traditional speedup is an unfair measure, as it favors
slow processors and poorly coded programs. They derive some fair performance measures for parallel
programs to correct this. The authors de ne sizeup as the ratio of the size of the problem solved
on the parallel computer to the size of the problem solved on the sequential computer in a given
xed amount of time. Consider two parallel computers M1 and M2 for which the cost of each serial
operation is same but M1 executes the parallelizable operations faster than M2 . They show that
M1 will attain poorer speedups (even if the communication overheads are ignored), but according to
sizeup, M1 will be considered better. Based on the concept of xed time sizeup, Gustafson [21] has
developed the SLALOM benchmark for a distortion-free comparison of computers of widely varying
speeds.
Sun and Rover [51] de ne a scalability metric called the isospeed measure, which is the factor
by which the problem size has to be increased so that the average unit speed of computation remains
constant if the number of processors is raised from p to p . The average unit speed of a parallel
computer is de ned as its achieved speed ( TWP ) divided by the number of processors. Thus, if the
p W . The problem size W
number of processors is increased to p from p, then isospeed(p; p ) = pW
required for p processors is determined by the isospeed. For a perfectly parallelizable algorithm
with no communication, isospeed(p; p ) = 1 and W = p pW . For more realistic parallel systems,
isospeed(p; p ) < 1 and W > p pW .
The class PC* of problems de ned by Vitter and Simons [54] captures a class of problems with
ecient parallel algorithms on a PRAM. Informally, a problem in class P (the polynomial time
class) is in PC* if it has a parallel algorithm on a PRAM that can use a polynomial (in terms of
input size) number of processors and achieve some minimal eciency . Any problem in PC* has at
least one parallel algorithm such that for some eciency E , its isoeciency function exists and is a
polynomial. Between two ecient parallel algorithms in this class, obviously the one that is able to
use more processors is considered superior in terms of scalability.
Nussbaum and Agarwal [43] de ned scalability of a parallel architecture for a given algorithm
as the ratio of the algorithm's asymptotic speedup when run on the architecture in question to its
corresponding asymptotic speedup when run on an EREW PRAM. The asymptotic speedup is the
maximum obtainable speedup for a xed problem size given an unlimited number of processors. This
metric captures the communication overheads of an architecture by comparing the performance of
a given algorithm on it to the performance of the same algorithm on an ideal communication free
architecture. Informally, it re ects \the fraction of the parallelism inherent in a given algorithm that
0
0
0
0
0
0
0
0
0
0
0
0
9
0
can be exploited by any machine of that architecture as a function of problem size" [43]. Note that
this metric cannot be used to compare two algorithm-architecture pairs for solving the same problem
or to characterize the scalability (or unscalability) of a parallel algorithm. It can only be used to
compare parallel architectures for a given parallel algorithm. For any given parallel algorithm, its
optimum performance on a PRAM is xed. Thus, comparing two architectures for the problem is
equivalent to comparing the minimum execution times for a given problem size on these architectures.
In this context, this work can be categorized with that of other researchers who have investigated
the minimum execution time as a function of the size of the problem on an architecture with an
unlimited number of processors available [58, 16, 11].
According to Nussbaum and Agarwal, the scalability of an algorithm is captured in its performance on the PRAM. Some combination of the two metrics|the asymptotic speedup of the algorithm
on the PRAM, and the scalability of the parallel architecture|can be used as a combined metric
for the algorithm-architecture pair. A natural combination would be the product of the two, which
is equal to the maximum obtainable speedup (lets call it S max(W )) of the parallel algorithm on the
architecture under consideration for a problem size W given arbitrarily many processors. In the best
case, S max (W ) can grow linearly with W . In the worst case, it could be a constant. The faster it
grows with respect to W , the more scalable the parallel system should be considered. This metric
can favor parallel systems for which the processor-time product is worse as long as they run faster.
For example, for the all-pairs shortest path problem, this metric will favor the parallel algorithm [1]
based upon an inecient (N 3 log N ) serial algorithm over the parallel algorithms [34, 25] that are
based on Floyd's (N 3 ) algorithm. The reason is that S max (W ) is (N 3) for the rst algorithm
and is (N 2) for the second one.
4 Performance Analysis of Large Scale Parallel Systems
Given a parallel system the speedup on a problem of xed size may peak at a certain limit because the
overheads may grow faster than the additional computation power o ered by increasing the number
of processors. A number of researchers have analyzed the optimal number of processors required to
minimize parallel execution time for a variety of problems [11, 42, 52, 38].
Flatt and Kennedy [11, 10] derive some important upper-bounds related to the performance of
parallel computers in the presence of synchronization and communication overheads. They show that
if the overhead function satis es certain properties, then there exists a unique value p0 of the number
of processors for which TP , for a given W , is minimum. However, for this value of p, the eciency
of the parallel system is rather poor. Hence, they suggest that p should be chosen to maximize
the product of eciency and speedup4 (which is equivalent to maximizing the ratio of eciency to
parallel execution time) and analytically compute the optimal values of p. A major assumption in
)
their analysis is that the per-processor overhead to (W; p) (to (W; p) = To (W;p
p ) grows faster than (p).
As discussed in [16], this assumption limits the range of the applicability of the analysis. Further, the
They later suggest maximizing a combination of the number of processors and the parallel execution time. They
consider the weighted geometric mean Fx of E and S . For a given problem size, Fx (p) = (E (p))x (S (p))2?x , where
0 < x < 2.
4
10
analysis in [16] reveals that the better a parallel algorithm is (i.e., the slower t grows with p), the
higher the value of p0. For many algorithm architecture combinations, p0 exceeds the limit imposed
by the degree of concurrency on the number of processors that can be used [16]. Thus the theoretical
value of p0 and the eciency at this point may not be useful in studying many parallel algorithms.
Flatt and Kennedy [11] also discuss scaling up the problem size with the number of processors.
They de ne scaled speedup as k times the ratio of T (W; p) and T (kW; kp). Under the simplifying
assumptions that t depends only on the number of processors5 and the number of serial steps in the
algorithm remain constant as the problem size is increased, they prove that there is an upper-bound
on the scaled speedup. They argue that since logarithmic per-processor overhead is the
of P (
)
best possible, the upper-bound on the scaled speedup is ( log ). The reader should note that there
are many important natural parallel systems where the per-processor overhead t is smaller than
(log p). For instance, in two of the three benchmark problems used in the Sandia experiment [22],
the per-processor overhead was constant.
Eager et al. [9] use the the average parallelism measure to locate the position (in terms of number
of processors) of the knee in the execution time-eciency curve. The knee occurs at the same value
of p for which the ratio of eciency to execution time is maximized. Determining the location of the
knee is important when the same parallel computer is being used for many applications so that the
processors can be partitioned among the applications in an optimal fashion.
Tang and Li [52] prove that maximizing the eciency to T ratio is equivalent to minimizing
2
p(T ) . They propose minimization of a more general expression; i.e., p(T ) . Here r determines the
relative importance of eciency and the parallel execution time. The choice of a higher r implies that
reducing T is given more importance. Consequently, the system utilizes more processors, thereby
operating at a low eciency. A low value of r means greater emphasis on improving the eciency
than on reducing T ; hence, the operating point of the system will use fewer processors and yield a
high eciency.
In [16], for a fairly general class of overhead functions, Gupta and Kumar analytically determine
the optimal number of processors to minimize T as well as the more general metric p(T ) . It is then
shown that for a wide class of overhead functions, minimizing any of these metrics is asymptotically
equivalent to operating at a xed eciency that depends only on the overhead function of the parallel
system and the value of r.
As a re nement of the model of Gustafson et al., Zhou [60] and Van Catledge [53] present models
that predict performance as a function of the serial fraction of the parallel algorithm. They conclude
that by increasing the problem size, one can obtain speedups arbitrarily close to the number of
processors. However, the rate at which the problem size needs to be scaled up in order to achieve this
depends on the serial component of the algorithm used to solve the problem. If the serial component
is a constant, then the problem size can be scaled up linearly with the number of processors in order
to get arbitrary speedups. In this case, isoeciency analysis also suggests that a linear increase in
problem size w.r.t. p is sucient to maintain any arbitrary eciency. If the serial component grows
with the problem size, then the problem size may have to be increased faster than linearly w.r.t.
the number of processors. In this case, isoeciency analysis not only suggests that a higher rate of
o
P
P
o
k
T
kW;kp
k
k
o
P
P
P
r
P
P
P
5
P
Very often, apart from p, to is also a function of W .
11
r
growth of W w.r.t p will be required, but also provides an expression for this rate of growth, thus
answering the question posed in [60] as to how fast the problem should grow to attain speedups close
to p.
The analysis in [60] and [53] does not take the communication overheads into account. The
serial component Wserial of an algorithm contributes Wserial (p ? 1) pWserial to the total overhead
cost of parallel execution using p processors. The reason is that while one processor is busy on
the sequential part, the remaining (p ? 1) are idle. Thus the serial component can model the
communication overhead as long as the overhead grows linearly with p. However, if the overhead
due to communication grows faster or slower than (p), as is the case with many parallel systems,
the models based solely upon sequential vs. parallel component are not adequate.
Carmona and Rice [6, 7] provide new and improved interpretations for the parameters commonly
used in the literature such as serial fraction, and the portion of time spent on performing serial
work on a parallel system, etc. The dynamics of parallel program characteristics like speedup and
eciency are then presented as a function of these parameters for several theoretical and empirical
examples.
5 Relation Between Some Scalability Measures
After reviewing these various measures of scalability, one may ask whether there exists one measure
that is better than all others [23]? The answer to this question is no, as di erent measures are
suitable for di erent situations.
One situation arises when the problem at hand is xed and one is trying to use an increasing
number of processors to solve it. In this case, the speedup is determined by the serial fraction in the
program as well as other overheads such as those due to communication and due to redundant work.
In this situation choosing one parallel system over the other can be done using the standard speedup
metric. Note that for any xed problem size W , the speedup on a parallel system will saturate or
peak at some value S max(W ), which can also be used as a metric. Scalability issues for the xed
problem size case are addressed in [11, 27, 16, 42, 52, 58].
Another possible scenario is that in which a parallel computer with a xed number of processors
is being used and the best parallel algorithm needs to be chosen for solving a particular problem. For
a xed p, the eciency increases as the problem size is increased. The rate at which the eciency
increases and approaches one (or some other maximum value) with respect to increase in problem size
may be used to characterize the quality of the algorithm's implementation on the given architecture.
The third situation arises when the additional computing power due to the use of more processors
is to be used to solve bigger problems. Now the question is how should the problem size be increased
with the number of processors?
For many problem domains, it is appropriate to increase the problem size with the number of
processors so that the total parallel execution time remains xed. An example is the domain of
weather forecasting. In this domain, the size of the problem can be increased arbitrarily provided
that the problem can be solved within a speci ed time (e.g., it does not make sense to take more
than 24 hours to forecast the next day's weather). The scalability issues for such problems have been
12
explored by Worley [58], Gustafson [22, 20], and Sun and Ni [50].
Another extreme in scaling the problem size is to try as big problems as can be handled in the
memory. This is investigated by Worley [57, 58, 59], Gustafson [22, 20] and by Sun and Ni [50], and
is called the memory-constrained case. Since the total memory of a parallel computer increases with
increasing p, it is possible to solve bigger problems on parallel computer with bigger p. It should
also be clear that any problem size for which the memory requirement exceeds the total available
memory cannot be solved on the system.
An important scenario is that in which one is interested in making ecient use of the parallel
system; i.e., it is desired that the overall performance of the parallel system increases linearly with
p. Clearly, this can be done only for scalable parallel systems, which are exactly the ones for which a
xed eciency can be maintained for arbitrarily large p by simply increasing the problem size. For
such systems, it is natural to use isoeciency function or related metrics [31, 28, 8]. The analyses in
[60, 61, 11, 38, 42, 52, 9] also attempt to study the behavior of a parallel system with some concern
for overall eciency.
Although di erent scalability measures are appropriate for rather di erent situations, many of
them are related to each other. For example, from the isoeciency analysis, one can reach a number
of conclusions regarding the time-constrained case (i.e., when bigger problems are solved on larger
parallel computers with some upper-bound on the parallel execution time). It can be shown that for
cost-optimal algorithms, the problem size can be increased linearly with the number of processors
while maintaining a xed execution time if and only if the isoeciency function is (p). The proof
is as follows. Let ?(W ) be the degree of concurrency of the algorithm. Thus, as p is increased, W
has to be increased at least as (p), or else p will eventually exceed ?(W ). Note that ?(W ) is upperbounded by (W ) and p is upper-bounded by ?(W ). T is given by S + o ( ) = c + o ( ) .
Now consider the following two cases. Let the rst case be when ?(W ) is smaller than (W ). In
this case, even if as many as ?(W ) processors are used, the term ?(c ) of the expression for T will
diverge with increasing W , and hence, it is not possible to continue to increase the problem size and
maintain a xed parallel execution time. At the same time, the overall isoeciency function grows
faster than (p) because the isoeciency due to concurrency exceeds (p). In the second case in
which ?(W ) = (W ), as many as (W ) processors can be used. If (W ) processors are used, then
the rst term in T can be maintained at a constant value irrespective of W . The second term in
T will remain constant if and only if o( ) remains constant when p = (W ) (in other words, o
remains constant while p and W are of the same order). This condition is necessary and sucient
for linear isoeciency.
A direct corollary of the above result is that if the isoeciency function is greater than (p),
then the minimum parallel execution time will increase even if the problem size is increased as slowly
as linearly with the number of processors. Worley [57, 58, 59] has shown that for many algorithms
used in the scienti c domain, for any given T , there will exist a problem size large enough so that
it cannot be solved in time T , no matter how many processors are used. Our above analysis shows
that for these parallel systems, the isoeciency curves have to be worse than linear. It can be
easily shown that the isoeciency function will be greater than (p) for any algorithm-architecture
combination for which T > (p) for a given W . The latter is true when any algorithm with a global
T
P
p
t W
W
T
W;p
p
t W
T
W;p
p
p
P
P
T
P
W;p
T
p
W
P
P
o
13
operation (such as broadcast, and one-to-all and all-to-all personalized communication [5, 26]) is
implemented on a parallel architecture that has a message passing latency or message startup time.
Thus, it can be concluded that for any cost-optimal parallel algorithm involving global communication,
the problem size cannot be increased inde nitely without increasing the execution time on a parallel
computer having a startup latency for messages, no matter how many processors are used (up to a
maximum of W ). This class of algorithms includes some fairly important algorithms such as matrix
multiplication (all-to-all/one-to-all broadcast) [15], vector dot products (single node accumulation)
[19], shortest paths (one-to-all broadcast) [34], and FFT (all-to-all personalized communication) [18],
etc. The readers should note that the presence of a global communication operation in an algorithm is
a sucient but not a necessary condition for non-linear isoeciency on an architecture with message
passing latency. Thus, the class of algorithms having the above mentioned property is not limited
to the algorithms with global communication.
If the isoeciency function of a parallel system is greater than (p), then given a problem size
W , there is as lower-bound on the parallel execution time. This lower-bound (lets call it TPmin (W ))
is a non-decreasing function of W . The rate at which the TPmin (W ) for a problem (given arbitrarily
many processors) must increase with the problem size can also serve as a measure of scalability of the
parallel system. In the best case, TPmin (W ) is constant; i.e., larger problems can be solved in a xed
amount of time by simply increasing the number of processors. In the worst case, TPmin (W ) = (W ).
This happens when the degree of e ective parallelism is constant. The slower TPmin (W ) grows as
a function of the problem size, the more scalable the parallel system is. TPmin is closely related
to S max(W ) de ned in the context of Nussbaum and Agarwal's work in Section 3. For a problem
size W , these two metrics are related by W = S max(W ) TPmin (W ). Let (W ) be the number of
processors that should be used for obtaining the minimum parallel execution time TPmin (W ) for a
(W ))
problem of size W . Clearly, TPmin (W ) = (WW ) + To(W;
(W ) . Using (W ) processors leads to optimal
parallel execution time TPmin (W ), but may not lead to minimum pTP product (or the cost of parallel
execution). Now consider the cost-optimal implementation of the parallel system (i.e., when the
number of processors used for a given problem size is governed by the isoeciency
function). In this
To (W;f ?1 (W )) for a xed W .
case, if f (p) is the isoeciency function, then TP is given by f ?W
+
1 (W )
f ?1 (W )
Let us call this TPiso (W ). Clearly, TPiso (W ) can be no better than TPmin (W ). In [16], it is shown that
for a fairly wide class of overhead functions, the relation between problem size W and the number of
processors p at which the parallel run time TP is minimized, is given by the isoeciency function for
some value of eciency. Hence, TPiso (W ) and TPmin (W ) have the same asymptotic growth rate w.r.t.
to W for these types of overhead functions. For these parallel systems, no advantage (in terms of
asymptotic parallel execution time) is gained by making inecient use of the parallel processors.
Several researchers have proposed to use an operating point where the value of p(TP )r is minimized
for some constant r and for a given problem size W [11, 9, 52]. It can be shown [52] that this
corresponds to the point where ES r?1 is maximized for a given problem size. Note that the location
of the minima of p(TP )r (with respect to p) for two di erent algorithm-architecture combinations
can be used to choose one between the two. In [16], it is also shown that for a fairly large class
of overhead functions, the relation between the problem size and p for which the expression p(TP )r
is minimum for that problem size, is given by the isoeciency function (to maintain a particular
14
eciency E which is a function of r) for the algorithm and the architecture being used.
6 Impact of Technology Dependent Factors on Scalability
A number of researchers have considered the impact of the changes in CPU and communication speeds
on the performance of a parallel system. It is clear that if the CPU or the communication speed is
increased, the overall performance can only become better. But unlike a sequential computer, in a
parallel system, a k-fold increase in the CPU speed may not necessarily result in a k-fold reduction
in execution time for the same problem size.
For instance, consider the implementation of the FFT algorithm on a SIMD hypercube [18]. The
eciency of a N -point FFT computation on a p-processor cube is given by E = 1+ tw1 log p . Here tw
tc log N
is the time required to transfer one word of data between two directly connected processors and tc is
the time required to perform a unit FFT computation. In order to
maintain an eciency of E , the
E tw
isoeciency function of this parallel system is given by 1?EE tw p 1?E tc log p. Clearly, the scalability
of the system deteriorates very rapidly with the increase in the value of 1?EE ttwc . For a given E ,
this can happen either if the computation speed of the processors is increased (i.e., tc is reduced)
or the communication speed is decreased (i.e., tw is increased). For example, if tw = tc = 1, the
isoeciency function for maintaining
an eciency of 0.5 is p log p. If p is increased 10 times, then
p
the problem size needs to grow 10 10 times to maintain the same eciency. On the other hand,
a 10-fold increase in tw or a 10-fold decrease in tc will require one to solve a problem of size nearly
equal to W 10 if W was the problem size required to get the same eciency on the original machine.
Thus, technology dependent constants like tc and tw can have a dramatic impact on the scalability
of a parallel system.
In [48] and [15], the impact of these parameters is discussed on the scalability of parallel shortest
path algorithms, and matrix multiplication algorithms, respectively. In these algorithms, the isoeciency function does not grow exponentially (as in the case of FFT) with respect to the ttwc ratio, but
is a polynomial function of this ratio. For example, in the best-scalable parallel implementations of
matrix multiplication algorithm on the mesh [15], the isoeciency function is proportional to ( ttwc )3.
The implication of this is that for the same communication speed, using ten times faster processors
in the parallel computer will require a 1000 fold increase in the problem size to maintain the same
eciency. On p
the other hand, if p is increased by a factor of 10, then the same eciency can be
obtained on 10 10 times bigger problems. Hence, for parallel matrix multiplication, it is better to
have a parallel computer with k-fold as many processors rather than one with the same number of
processors, each k-fold as fast (assuming that the communication network and the bandwidth etc.
remain the same).
Nicol and Willard [42] consider the impact of CPU and communication speeds on the peak
performance of parallel systems. They study the performance of various parallel architectures for
solving Elliptic Partial Di erential Equations. They show that for this application, a k-fold increase1
in the speed of the CPUs results in an improvement in the optimal execution time by a factor of k 3
on a bus architecture if the communication speed is not improved. They also show that2 improving
only the bus speed by a factor of k reduces the optimal execution time by a factor of k 3 .
15
Kung [35] studied the impact of increasing CPU speed (while keeping the I/O speed xed) on
the memory requirement for a variety of problems. It is argued that if the CPU speed is increased
without a corresponding increase in the I/O bandwidth, the system will become imbalanced leading
to idle time during computations. An increase in the CPU speed by a factor of w.r.t. the I/O speed
requires a corresponding increase in the size of the memory of the system by a factor of H ( ) in
order to keep the system balanced. As an example it has been shown that for matrix multiplication,
H ( ) = 2 and for computations such as relaxation on a d-dimensional grid, H ( ) = d . The reader
should note that similar conclusions can be shown to hold for a variety of parallel computers if only
the CPU speed of the processors is increased without changing the bandwidth of the communication
channels.
The above examples show that improving the technology in only one direction (i.e., towards
faster CPUs) may not be a wise idea. The overall execution time will, of course, reduce by using
faster processors, but the speedup with respect to the execution time on a single fast processor
will decrease. In other words, increasing the speed of the processors alone without improving the
communication speed will result in diminishing returns in terms of overall speedup and eciency.
The decrease in performance is highly problem dependent. For some problems it is dramatic (e.g.,
parallel FFT [18]), while for others (e.g., parallel matrix multiplication) it is less severe.
7 Study of Cost E ectiveness of Parallel Systems
So far we have con ned the study of scalability of a parallel system to investigating the capability
of the parallel algorithm to e ectively utilize an increasing number of processors on the parallel
architecture. Many algorithms may be more scalable on more costly architectures (i.e., architectures
that are cost-wise less scalable due to high cost of expanding the architecture by each additional
processor). For example, the cost of a parallel computer with p processors organized in a mesh
con guration is proportional to 4p, whereas a hypercube con guration of the same size will have a
cost proportional to p log p because a p processor mesh has 4p communication links while a p processor
hypercube has p log p communication links. It is assumed that the cost is proportional to the number
of links and is independent of the length of the links. In such situations, one needs to consider
whether it is better to have a larger parallel computer of a cost-wise more scalable architecture that
is underutilized (because of poor eciency), or is it better to have a smaller parallel computer of
a cost-wise less scalable architecture that is better utilized. For a given amount of resources, the
aim is to maximize the overall performance which is proportional to the number of processors and
the eciency obtained on them. It is shown in [18] that under certain assumptions, it is more coste ective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that
large scale meshes are cheaper to construct than large hypercubes. On the other hand, it is quite
possible that the implementation of a parallel algorithm is more cost-e ective on an architecture
on which the algorithm is less scalable. Hence this type of cost analysis can be very valuable in
determining the most suitable architecture for a class of applications.
Another issue in the cost vs. performance analysis of parallel systems is to determine the tradeo s
between the speed of processors and the number of processors to be employed for a given budget.
16
From the analysis in [3] and [53], it may appear that higher performance is always obtained by using
fewer and faster processors. It can be shown that this is true only for those parallel systems in which
the overall overhead To grows faster than or equal to (p). The model considered in [53] is one
for which To = (p), as they consider the case of a constant serial component. As an example of a
parallel system where their conclusion is not applicable, consider the matrix multiplication on a SIMD
mesh architecture [1]. Here To = tw N 2pp for multiplying N N matrices. Thus TP = tc Np + tw Npp .
Clearly, a better speedup can be obtained by increasing p by a factor of k (until p reaches N 2) rather
than reducing tc by the same factor.
Even for the case in which To grows faster than or equal to (p), it might not always be coste ective to have fewer faster processors. The reason is that the cost of a processor increases very
rapidly with respect to speed once a technology dependent threshold is reached. For example, the cost
of a processor with 2 ns clock speed is far more than 10 times the cost of a processor with 20 ns clock
speed given today's technology. Hence, it may be more cost e ective to get higher performance by
having a large number of relatively less utilized processors than to have a small number of processors
whose speed is beyond the current technology threshold.
Some analysis of the cost-e ectiveness of parallel computers is given by Barton and Withers in
[3]. They de ne cost C of an ensemble of p processors to be equal to dpV b , where V is the speed of
each individual processor in FLOPS ( oating point operations per second), d and b are constants,
b is typically greater than 1. Now for a givenr xed cost and a speci c problem size, the actual
, where s is the serial fraction, K = ( Cd ) b is
delivered speed in FLOPS is given by 1+(p?1)Kp
f + KprWtc p
a constant, W is the problem size in terms of number of oating point operations, r = b?b 1 and tc (p)
denotes the time spent in interprocessor communication etc. From this expression, it can be shown
that for a given cost and a xed problem instance, the delivered FLOPS peaks for a certain number
of processors.
As noted in [3], this analysis is valid only for a xed problem size. Actually, the value of p
for peak performance also increases as the problem size is increased. Thus if the problem is scaled
properly, it could still be bene cial to use more and more processors rather than opting for fewer
faster processors. The relation between the problem size and the number of processors at which the
peak performance is obtained for a given C , could also serve as a measure of scalability of the parallel
system under the xed cost paradigm. For a xed cost, the slower the rate at which the problem
size has to increase with increasing the number of processors that yield maximum throughput, the
more scalable the algorithm-architecture combination should be considered.
The study of cost related issues of a parallel system is quite complicated because the cost of a
parallel computer depends on many di erent factors. Typically, it depends on the number of processors used, the con guration in which they are connected (this determines the number and the length
of communication links), the speed of processors and the speed of the communication channels, and
usually it is a complex non-linear function of all these factors. For a given cost, its optimal distribution among the di erent components (e.g., processors, memory, cache, communication network, etc.)
of the parallel computer depends on the computation and communication pattern of the application
and the size of the problem to be solved.
While studying the performance of a parallel system, its eciency E is often considered as an
3
2
1
( )
17
important attribute. Obtaining or maintaining a good eciency as the parallel system is scaled up
is an important concern of the implementors of the parallel system and often the ease with which
this objective can be met determines the degree of scalability of the parallel system. In fact the
term commonly known as eciency is a loose term and it should more precisely be called processoreciency because it is measure of the eciency with which the processors in the parallel computer
are being utilized. Usually a parallel computer using p processors is not p times as costly as a
sequential computer with a similar processor. Therefore, processor-eciency is not a measure of the
eciency with which the money invested in building a parallel computer is being utilized. If using
p processors results in a speedup of S over a single processor, then Sp gives the processor-eciency
of the parallel system. Analogously, if an investment of C (p) dollars yields a speedup of S , then
assuming unit cost for a single processor system, the ratio CS(p) should determine the eciency with
which the cost of building a parallel computer with p processors is being utilized. We can de ne CS(p)
as the cost-eciency EC of the parallel system, which, from a practical point of view, might be a
more useful and insightful measure to compare two parallel systems.
8 Concluding Remarks
Signi cant progress has been made in identifying and developing measures of scalability in the last
few years. These measures can be used to analyze whether parallel processing can o er desired
performance improvement for problems at hand. They can also help guide the design of large-scale
parallel computers. It seems clear that no single scalability metric would be better than all others.
Di erent measures will be useful under di erent contexts and further research is needed along several
directions. Nevertheless, a number of interesting properties of several metrics and parallel systems
have been identi ed. For example, we show that a cost-optimal parallel algorithm can be used
to solve arbitrarily large problem instances in a xed time if and only if its isoeciency is linear.
Arbitrarily large instances of problems involving global communication cannot be solved in constant
time on a parallel machine with message passing latency. To make ecient use of processors while
solving a problem on a parallel computer, it is necessary that the number of processors be governed
by the isoeciency function. If more processors are used, the problem might be solved faster, but
the parallel system will not be utilized eciently. For a wide class of parallel systems identi ed in
[16], using more processors than determined by the isoeciency function does not help in reducing
the parallel time complexity of the algorithm.
The introduction of hardware cost factors (in addition to speedup and eciency) in the scalability
analysis is important so that the overall cost-e ectiveness can be determined. The current work in this
direction is still very preliminary. Another problem that needs much attention is that of partitioning
of processor resources among di erent problems. If two problems are to be solved on a multicomputer
and one of them is poorly scalable while the other one has very good scalability characteristics on
the given machine, then clearly, allocating more processors to the more scalable problem tends to
optimize the eciency, but not the overall execution time. The problem of partitioning becomes
more complex as the number of di erent computations is increased. Some early work on this topic
has been reported in [4, 44, 37, 38].
18
References
[1] S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Englewood Cli s, NJ, 1989.
[2] G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS
Conference Proceedings, pages 483{485, 1967.
[3] M. L. Barton and G. R. Withers. Computing performance as a function of the speed, quantity, and the cost of
processors. In Supercomputing '89 Proceedings, pages 759{764, 1989.
[4] Krishna P. Belkhale and Prithviraj Banerjee. Approximate algorithms for the partitionable independent task
scheduling problem. In Proceedings of the 1990 International Conference on Parallel Processing, pages I72{I75,
1990.
[5] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall,
Englewood Cli s, NJ, 1989.
[6] E. A. Carmona and M. D. Rice. A model of parallel performance. Technical Report AFWL-TR-89-01, Air Force
Weapons Laboratory, 1989.
[7] E. A. Carmona and M. D. Rice. Modeling the serial and parallel fractions of a parallel algorithm. Journal of
Parallel and Distributed Computing, 1991.
[8] S. Chandran and Larry S. Davis. An approach to parallel vision algorithms. In R. Porth, editor, Parallel Processing.
SIAM, Philadelphia, PA, 1987.
[9] D. L. Eager, J. Zahorjan, and E. D. Lazowska. Speedup versus eciency in parallel systems. IEEE Transactions
on Computers, 38(3):408{423, 1989.
[10] Horace P. Flatt. Further applications of the overhead model for parallel systems. Technical Report G320-3540,
IBM Corporation, Palo Alto Scienti c Center, Palo Alto, CA, 1990.
[11] Horace P. Flatt and Ken Kennedy. Performance of parallel processors. Parallel Computing, 12:1{20, 1989.
[12] G. C. Fox, M. Johnson, G. Lyzenga, S. W. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent
Processors: Volume 1. Prentice-Hall, Englewood Cli s, NJ, 1988.
[13] Ananth Grama, Anshul Gupta, and Vipin Kumar. Isoeciency: Measuring the scalability of parallel algorithms
and architectures. IEEE Parallel and Distributed Technology, 1(3):12{21, August, 1993. Also available as Technical
Report TR 93-24, Department of Computer Science, University of Minnesota, Minneapolis, MN.
[14] Ananth Grama, Vipin Kumar, and V. Nageshwara Rao. Experimental evaluation of load balancing techniques for
the hypercube. In Proceedings of the Parallel Computing '91 Conference, pages 497{514, 1991.
[15] Anshul Gupta and Vipin Kumar. The scalability of matrix multiplication algorithms on parallel computers.
Technical Report TR 91-54, Department of Computer Science, University of Minnesota, Minneapolis, MN, 1991.
A short version appears in Proceedings of 1993 International Conference on Parallel Processing, pages III-115{III119, 1993.
[16] Anshul Gupta and Vipin Kumar. Performance properties of large scale parallel systems. Journal of Parallel and
Distributed Computing, 19:234{244, 1993. Also available as Technical Report TR 92-32, Department of Computer
Science, University of Minnesota, Minneapolis, MN.
[17] Anshul Gupta and Vipin Kumar. A scalable parallel algorithm for sparse matrix factorization. Technical Report
94-19, Department of Computer Science, University of Minnesota, Minneapolis, MN, 1994. A short version appears
in Supercomputing '94 Proceedings. TR available in users/kumar at anonymous FTP site ftp.cs.umn.edu.
[18] Anshul Gupta and Vipin Kumar. The scalability of FFT on parallel computers. IEEE Transactions on Parallel
and Distributed Systems, 4(8):922{932, August 1993. A detailed version available as Technical Report TR 90-53,
Department of Computer Science, University of Minnesota, Minneapolis, MN.
[19] Anshul Gupta, Vipin Kumar, and A. H. Sameh. Performance and scalability of preconditioned conjugate gradient
methods on parallel computers. IEEE Transactions on Parallel and Distributed Systems, 1995 (To Appear). Also
available as Technical Report TR 92-64, Department of Computer Science, University of Minnesota, Minneapolis,
MN. A short version appears in Proceedings of the Sixth SIAM Conference on Parallel Processing for Scienti c
Computing, pages 664{674, 1993.
19
[20] John L. Gustafson. Reevaluating Amdahl's law. Communications of the ACM, 31(5):532{533, 1988.
[21] John L. Gustafson. The consequences of xed time performance measurement. In Proceedings of the 25th Hawaii
International Conference on System Sciences: Volume III, pages 113{124, 1992.
[22] John L. Gustafson, Gary R. Montry, and Robert E. Benner. Development of parallel methods for a 1024-processor
hypercube. SIAM Journal on Scienti c and Statistical Computing, 9(4):609{638, 1988.
[23] Mark D. Hill. What is scalability? Computer Architecture News, 18(4), 1990.
[24] Kai Hwang. Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, New York,
NY, 1993.
[25] Jing-Fu Jenq and Sartaj Sahni. All pairs shortest paths on a hypercube multiprocessor. In Proceedings of the
1987 International Conference on Parallel Processing, pages 713{716, 1987.
[26] S. L. Johnsson and C.-T. Ho. Optimum broadcasting and personalized communication in hypercubes. IEEE
Transactions on Computers, 38(9):1249{1268, September 1989.
[27] Alan H. Karp and Horace P. Flatt. Measuring parallel processor performance. Communications of the ACM,
33(5):539{543, 1990.
[28] Clyde P. Kruskal, Larry Rudolph, and Marc Snir. A complexity theory of ecient parallel algorithms. Technical
Report RC13572, IBM T. J. Watson Research Center, Yorktown Heights, NY, 1988.
[29] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction to Parallel Computing: Design
and Analysis of Algorithms. Benjamin/Cummings, Redwood City, CA, 1994.
[30] Vipin Kumar, Ananth Grama, and V. Nageshwara Rao. Scalable load balancing techniques for parallel computers.
Technical Report 91-55, Computer Science Department, University of Minnesota, 1991. To appear in Journal of
Distributed and Parallel Computing, 1994.
[31] Vipin Kumar and V. N. Rao. Parallel depth- rst search, part II: Analysis. International Journal of Parallel
Programming, 16(6):501{519, 1987.
[32] Vipin Kumar and V. N. Rao. Load balancing on the hypercube architecture. In Proceedings of the Fourth
Conference on Hypercubes, Concurrent Computers, and Applications, pages 603{608, 1989.
[33] Vipin Kumar and V. N. Rao. Scalable parallel formulations of depth- rst search. In Vipin Kumar, P. S. Gopalakrishnan, and Laveen N. Kanal, editors, Parallel Algorithms for Machine Intelligence and Vision. Springer-Verlag,
New York, NY, 1990.
[34] Vipin Kumar and Vineet Singh. Scalability of parallel algorithms for the all-pairs shortest path problem. Journal
of Parallel and Distributed Computing, 13(2):124{138, October 1991. A short version appears in the Proceedings
of the International Conference on Parallel Processing, 1990.
[35] H. T. Kung. Memory requirements for balanced computer architectures. In Proceedings of the 1986 IEEE Symposium on Computer Architecture, pages 49{54, 1986.
[36] J. Lee, E. Shragowitz, and S. Sahni. A hypercube algorithm for the 0/1 knapsack problem. In Proceedings of 1987
International Conference on Parallel Processing, pages 699{706, 1987.
[37] Michael R. Leuze, Lawrence W. Dowdy, and Kee Hyun Park. Multiprogramming a distributed-memory multiprocessor. Concurrency: Practice and Experience, 1(1):19{33, September 1989.
[38] Y. W. E. Ma and Denis G. Shea. Downward scalability of parallel architectures. In Proceedings of the 1988
International Conference on Supercomputing, pages 109{120, 1988.
[39] Dan C. Marinescu and John R. Rice. On high level characterization of parallelism. Technical Report CSD-TR1011, CAPO Report CER-90-32, Computer Science Department, Purdue University, West Lafayette, IN, Revised
June 1991. To appear in Journal of Parallel and Distributed Computing, 1993.
[40] Paul Messina. Emerging supercomputer architectures. Technical Report C3P 746, Concurrent Computation
Program, California Institute of Technology, Pasadena, CA, 1987.
[41] Cleve Moler. Another look at Amdahl's law. Technical Report TN-02-0587-0288, Intel Scienti c Computers, 1987.
20
[42] David M. Nicol and Frank H. Willard. Problem size, parallel architecture, and optimal speedup. Journal of
Parallel and Distributed Computing, 5:404{420, 1988.
[43] Daniel Nussbaum and Anant Agarwal. Scalability of parallel machines. Communications of the ACM, 34(3):57{61,
1991.
[44] Kee Hyun Park and Lawrence W. Dowdy. Dynamic partitioning of multiprocessor systems. International Journal
of Parallel Processing, 18(2):91{120, 1989.
[45] Michael J. Quinn and Year Back Yoo. Data structures for the ecient solution of graph theoretic problems on
tightly-coupled MIMD computers. In Proceedings of the 1984 International Conference on Parallel Processing,
pages 431{438, 1984.
[46] S. Ranka and S. Sahni. Hypercube Algorithms for Image Processing and Pattern Recognition. Springer-Verlag,
New York, NY, 1990.
[47] V. N. Rao and Vipin Kumar. Parallel depth- rst search, part I: Implementation. International Journal of Parallel
Programming, 16(6):479{499, 1987.
[48] Vineet Singh, Vipin Kumar, Gul Agha, and Chris Tomlinson. Scalability of parallel sorting on mesh multicomputers. International Journal of Parallel Programming, 20(2), 1991.
[49] Xian-He Sun and John L. Gustafson. Toward a better parallel performance metric. Parallel Computing, 17:1093{
1109, December 1991. Also available as Technical Report IS-5053, UC-32, Ames Laboratory, Iowa State University,
Ames, IA.
[50] Xian-He Sun and L. M. Ni. Another view of parallel speedup. In Supercomputing '90 Proceedings, pages 324{333,
1990.
[51] Xian-He Sun and Diane Thiede Rover. Scalability of parallel algorithm-machine combinations. Technical Report
IS-5057, Ames Laboratory, Iowa State University, Ames, IA, 1991. To appear in IEEE Transactions on Parallel
and Distributed Systems.
[52] Zhimin Tang and Guo-Jie Li. Optimal granularity of grid iteration problems. In Proceedings of the 1990 International Conference on Parallel Processing, pages I111{I118, 1990.
[53] Fredric A. Van-Catledge. Towards a general model for evaluating the relative performance of computer systems.
International Journal of Supercomputer Applications, 3(2):100{108, 1989.
[54] Je rey Scott Vitter and Roger A. Simons. New classes for parallel complexity: A study of uni cation and other
complete problems for P. IEEE Transactions on Computers, May 1986.
[55] Jinwoon Woo and Sartaj Sahni. Hypercube computing: Connected components. Journal of Supercomputing, 1991.
Also available as TR 88-50 from the Department of Computer Science, University of Minnesota, Minneapolis, MN.
[56] Jinwoon Woo and Sartaj Sahni. Computing biconnected components on a hypercube. Journal of Supercomputing,
June 1991. Also available as Technical Report TR 89-7 from the Department of Computer Science, University of
Minnesota, Minneapolis, MN.
[57] Patrick H. Worley. Information Requirements and the Implications for Parallel Computation. PhD thesis, Stanford
University, Department of Computer Science, Palo Alto, CA, 1988.
[58] Patrick H. Worley. The e ect of time constraints on scaled speedup. SIAM Journal on Scienti c and Statistical
Computing, 11(5):838{858, 1990.
[59] Patrick H. Worley. Limits on parallelism in the numerical solution of linear PDEs. SIAM Journal on Scienti c
and Statistical Computing, 12:1{35, January 1991.
[60] Xiaofeng Zhou. Bridging the gap between Amdahl's law and Sandia laboratory's result. Communications of the
ACM, 32(8):1014{5, 1989.
[61] J. R. Zorbas, D. J. Reble, and R. E. VanKooten. Measuring the scalability of parallel computer systems. In
Supercomputing '89 Proceedings, pages 832{841, 1989.
21