A Brinding Model
A Brinding Model
A Brinding Model
Leslie G. Valianta,∗,1
a
School of Engineering and Applied Sciences
Harvard University
Abstract
Writing software for one parallel system is a feasible though arduous task.
Reusing the substantial intellectual effort so expended for programming a
second system has proved much more challenging. In sequential comput-
ing algorithms textbooks and portable software are resources that enable
software systems to be written that are efficiently portable across changing
hardware platforms. These resources are currently lacking in the area of
multi-core architectures, where a programmer seeking high performance has
no comparable opportunity to build on the intellectual efforts of others.
In order to address this problem we propose a bridging model aimed at
capturing the most basic resource parameters of multi-core architectures. We
suggest that the considerable intellectual effort needed for designing efficient
algorithms for such architectures may be most fruitfully expended in design-
ing portable algorithms, once and for all, for such a bridging model. Portable
algorithms would contain efficient designs for all reasonable combinations of
the basic resource parameters and input sizes, and would form the basis for
implementation or compilation for particular machines.
Our Multi-BSP model is a multi-level model that has explicit parameters
for processor numbers, memory/cache sizes, communication costs, and syn-
chronization costs. The lowest level corresponds to shared memory or the
PRAM, acknowledging the relevance of that model for whatever limitations
on memory and processor numbers it may be efficacious to emulate it.
We propose parameter-aware portable algorithms that run efficiently on
∗
This work was supported in part by NSF-CCF-04-27129. A preliminary version ap-
peared in Proc. 16th Annual European Symposium on Algorithms, Sept 15-17, 2008,
Karlsruhe Germany, LNCS Vol 5193, pp. 13-28, and had also been presented at SPAA,
June 14-16, 2008.
Email addresses: [email protected] (Leslie G. Valiant)
1. Introduction
Multi-core architectures, based on many processors and associated local
caches or memories, are attractive devices given current technological pos-
sibilities, and known physical limitations. However, the designer of parallel
algorithms for such machines has to face so many challenges that the road to
their efficient exploitation is not clearly signposted. Among these challenges
consider the following: First, the underlying computational substrate is much
more intricate than it is for conventional sequential computing and hence the
design effort is much more onerous. Second, the resulting algorithms have to
compete with and outperform existing sequential algorithms that are often
much better understood and highly optimized. Third, the ultimate reward
of all this effort is limited, at best a speedup of a constant factor, essen-
tially the number of processors. Fourth, machines differ, and therefore any
speedups obtained for one machine may not translate to speedups on oth-
ers, so that all the design effort may be substantially wasted. For all these
reasons it is problematic how or whether efficient multi-core algorithms will
be created and exploited in the foreseeable future. It is possible that even
as these machines proliferate, their computational potential will be greatly
underutilized.
We have argued previously that the general problem of parallel computing
should be approached via two notions [39, 18]. The first is the notion of,
portable parallel algorithms, algorithms that are parameter-aware and run
efficiently on machines with the widest range of values of these parameters.
We suggest that the creation of such algorithms needs to be adopted as an
important goal. The second notion is that such portable algorithms have to
2
be expressed in terms of a bridging model, one that bridges in a performance-
faithful manner what the hardware executes and what is in the mind of the
software writer. In particular the algorithms need to be written in a language
that can be compiled efficiently on to the bridging model. It is this bridging
model that defines the necessary performance parameters for the parameter-
aware software, and is a prerequisite for portable parallel algorithms to be
possible.
The originally proposed bridging model for parallel computation was the
BSP model [39]. Its main features are that: (i) it is a computational model
in that it is unambiguous, (ii) it incorporates numerical parameters that are
intended to reflect ultimately unevadable physical constraints, and (iii) it
has barrier synchronization as its one nonlocal primitive, an operation that
is powerful yet relatively easy to realize. We note that in such a parametrized
model some alternative choices can also be made, and a variety of these have
been explored in some detail [12, 13, 7, 6, 14].
In this paper we introduce the Multi-BSP model which extends BSP in
two ways. First, it is a hierarchical model, with an arbitrary number of
levels. It recognizes the physical realities of multiple memory and cache
levels both within single chips as well as in multi-chip architectures. The
aim is to model all levels of an architecture together, even possibly for whole
datacenters. Second, at each level, Multi-BSP incorporates memory size as
a further parameter. After all, it is the physical limitation on the amount
of memory that can be accessed in a fixed amount of time from the physical
location of a processor that creates the need for multiple levels.
The Multi-BSP model for depth d will be specified by 4d numerical pa-
rameters (p1 , g1 , L1 , m1 )(p2 , g2 , L2 , m2 )(p3 , g3 , L3 , m3 ) · · · (pd , gd , Ld , md ). It is
a depth d tree with memories/caches at the internal nodes and processors at
the leaves. At each level the four parameters quantify, respectively, the num-
ber of subcomponents, the communication bandwidth, the synchronization
cost, and the memory/cache size.
It may be thought that proliferating numerical parameters only further
exponentiates the difficulty of designing parallel algorithms. The main obser-
vation of this paper is that this is not necessarily the case. In particular we
show, by means mostly of well-known ideas, that for problems such as stan-
dard matrix multiplication, Fast Fourier Transform and comparison sorting,
algorithms can be written that are optimal in a parameter-free sense, even
in the presence of this plethora of parameters. Our purpose is to persuade
that it is feasible and beneficial to write down the best algorithmic ideas we
3
have in a standardized form that is compilable to run efficiently on arbitrary
machines and guaranteed to be optimal in a specifiable sense.
In order to elucidate this striking phenomenon, we shall define a parameter-
free notion of an optimal Multi-BSP algorithm with respect to a given algo-
rithm A to mean the following: (i) It is optimal in parallel computation steps
to multiplicative constant factors and in total computation steps to within
additive lower order terms, (ii) it is optimal in parallel communication costs
to constant multiplicative factors among Multi-BSP algorithms, and (iii) it
is optimal in synchronization costs to within constant multiplicative factors
among Multi-BSP algorithms. (We note that some of the algorithms we
describe can be made optimal to within additive lower order terms also in
parallel computation steps, but this is a complication that we do not pursue
here. We also note that the actual lower bounds we shall derive for (ii) hold
for more general distributed models also.)
Insisting on optimality to a factor of one in total computation time is
a significant requirement and imposes a useful discipline, we believe. We
tolerate multiplicative constant factors kcomp , kcomm , and ksynch , in the other
three measures, but for each depth d we insist that these be independent of
the p, g, L and m parameters. Identifying the best achievable combinations
of these constant factors can be challenging if all combinations of parameters
are to be anticipated, and is left for future research. The ultimate goal is
(kcomp , kcomm , ksynch )–optimal algorithms where the k’s are close to one, but
in this paper we will be satisfied with (O(1), O(1), O(1))–optimality.
There have existed several previous models that have substantial com-
monality with Multi-BSP. Using memory size as a fourth BSP parameter
was proposed and investigated by Tiskin [38] and by McColl and Tiskin [29].
In a different direction a variety of hierarchical versions of BSP have been
proposed such as the D-BSP of de la Torre and Kruskal [13], which has been
further investigated by Bilardi et al. [6, 8]. In [8] a network-oblivious result is
proved in this communication hierarchical context that, like our analyses, al-
lows for arbitrary parameters at each level. The D-BSP captures hierarchies
in communication while Multi-BSP seeks additionally to capture hierarchies
in the cache/memory system.
In a different direction, numerous models have been proposed for studying
varieties of memory or cache hierarchies both in the sequential [1] and in the
parallel [42] contexts. In Alpern et al. [4] a tree structure of memories akin
to ours is defined. For such hierarchical models in both the sequential and
parallel contexts authors have generally taken some uniform cost view of the
4
various levels, rather than analyzing the effect of arbitrary parameters at
each level. Savage [33, 34] has analyzed the communication requirements of
a hierarchical memory model, with arbitrary parameters at each level, using
a generalization of the Hong-Kung [23] pebble model.
Recently, also motivated by multi-core machines, a multi-level cache model
has been proposed and analyzed by Blelloch et al. [9] and Chowdhury and
Ramachandran [11]. In contrast with Multi-BSP, this model has a parame-
ter for cache block size, but is less explicit on the issue of synchronization.
The analyses published for this model have been for two levels with arbitrary
parameters, but extensions are possible for arbitarary numbers of levels (Ra-
machandran [31]). Even more recently a multi-core model has been proposed
by Savage and Zubair [35] that provides a precise model of the communica-
tion requirements in such architectures. There they independently give lower
bounds for the communication requirements of FFT and matrix multiplica-
tion similar to these parts of our Theorem 1.
We emphasize that, in comparison with the previous literature, our goal
here is that of finding a bridging model that isolates the most fundamental
issues of multi-core computing and allows them to be usefully studied in
some detail. The Multi-BSP model reflects the view that fundamentally
there are just two unevadable sources of increasing cost that the physical
world imposes at increasing distances: (i) a cost g for bandwidth, and (ii)
a cost L related to latency that must be charged for synchronization and
for messages that are too short. The model is a comprehensive model of
computation in that it has mechanisms for synchronization as well as for
computation and communication. The suggestion is that these unevadable
costs already capture enough complications that we would be best advised
to understand algorithmic issues in this bridging framework first.
The goal here is to identify a bridging model on which the community can
agree, one which would influence the design of both software and hardware.
It will always be possible to have performance models that reflect a particular
architecture in greater detail than does any bridging model, but such models
are not among our goals here.
There are many issues relevant to multi-core computing that we do not
explore here. These include the use of multi-core for executing independent
tasks, code automatically compiled from sequential code, and code compiled
from languages in which parallelism is expressed but not scheduled. This
paper is predicated on the idea that there will be a demand for exploiting
multi-core architectures beyond what is possible by these means. A further
5
issue not discussed here is the role of non-homogeneous cores [22, 28].
The main commonality between our algorithmic results and previous lit-
erature is the observation that certain recursive algorithms are well suited to
computational models with multiple parameters. We push this observation
further by allowing an arbitrary number of arbitrary parameters. Program-
ming even simple recursive algorithms to run efficiently for all input sizes
even on one machine can be an onerous task. Our finding is that for certain
important problems, the use of a bridging model enables one to make one
big effort, once and for all, to write a program that is efficient for all inputs
and all machines.
6
Level i component
pi components
Synchronization Cost: Li
Level i memory: mi
Data rate: gi
7
Level 1 component
p1 processors
Data rate: 1
Level 1 memory: m1
Data rate: g1
8
will be Qi,j = pi+1 · · · pj , and the number in the whole system will be Qi,d =
Qi = pi+1 · · · pd . The total memory available within a level i component
will be Mi = mi + pi mi−1 + pi−1 pi mi−2 + · · · + p2 · · · pi−1 pi m1 . The gap
or bandwidth parameter that characterizes the cost of communication from
level 1 to outside level i is Gi = gi + gi−1 + gi−2 + · · · + g1 .
Since the intention is to model the entire system, defining as many levels
as necessary, we assume by convention that Qd = 1 and that gd is infinite.
In other words the total machine is regarded as a single component, and
includes any ”external storage” to which communication is to be analyzed.
For the same reason it is assumed that for any problem instance of size n
and an algorithm for it, the level d memory is in fact sufficient to support
the computation, and certainly md ≥ n. In the applications in this paper
md = O(n) is sufficient.
We make the assumption that for all i
mi ≥ mi−1 (1)
mi ≥ Mi /i. (2)
Both of these hold for conventional cache hierarchies where any word at one
level has distinct copies at every higher level. (We note that in the treatment
here we do not otherwise distinguish between memory and cache.) Finally, we
note that a major complication in the analysis of our algorithms arises from
allowing arbitrary variations in the successive mi . For our sorting algorithm
it will be convenient to impose the mild constraint that for some constant k
for all i
9
The BSP model [39] with parameters (p, g, L) where the basic unit has
memory m would be modeled with d = 2 and (p1 = 1, g1 = g, L1 = 0, m1 =
m)(p2 = p, g2 = ∞, L2 = L, m2 ). The difference is that in the basic BSP
model direct communication is allowed horizontally between units at the
same level, while in Multi-BSP such communication would need to be simu-
lated via memory at a higher level. This case (p1 = 1, g1 = g, L1 = 0, m1 =
m)(p2 = p, g2 = ∞, L2 = L, m2 ) corresponds to the BSPRAM model of
Tiskin [38].
In general, we regard m = min{mi |1 ≤ i ≤ d} and the input size n
to be large numbers. In relating resource bounds F1 , F2 expressed in terms
of the parameters {pi , gi , Li , mi |1 ≤ i ≤ d} and the input size n, we shall
define the relation F1 - F2 to mean that for all ε > 0, F1 < (1 + ε)F2
for all sufficiently large values of m and n. This enables expressions such as
1/2
(1+1/mi ), (1+1/mi ) or (1+1/ log mi ) to be approximately upper bounded
by 1.
Also, we define F1 -d F2 to mean that for some constant cd depending
possibly on d but not on n or on any of the parameters {pi , gi , Li , mi |1 ≤ i ≤
d} it is the case that F1 < cd F2 for all sufficiently large values of n and m.
Because we can suppress constant multipliers with these notations, we
shall sometimes identify a parameter such as mj , for example, with a fixed
multiple of itself.
For a Multi-BSP algorithm A∗ we shall define Comp(A∗ ), Comm(A∗ ),
and Synch(A∗ ) to be the parallel costs of computation, communication, and
synchronization respectively on a Multi-BSP machine H in the sense that for
any computation of A∗ on H and along any single critical path in it, at most
Comp(A∗ ) processor operations have been executed and at most Comm(A∗ )
communication charge and at most Synch(A∗ ) synchronization charge has
been incurred. (For randomized algorithms the same claim holds with high
probability.) Note that all three charges are expressed in terms of the basic
unit of time taken by a processor to perform one operation.
To quantify the efficiency of A∗ we specify a baseline algorithm A of which
A∗ is the Multi-BSP implementation and for that:
(i) Compseq (A) is the total number of computational operations of A. Comp(A)
is Compseq (A)/Pd where Pd is the total number of processors in H.
(ii) Comm(A) is the minimal communication cost of any Multi-BSP imple-
mentation of A on H.
(iii) Synch(A) is the minimal synchronization cost of any Multi-BSP imple-
mentation of A on H.
10
A Multi-BSP algorithm A∗ is optimal with respect to algorithm A if
(i) Comp(A∗ ) -d Comp(A), and Compseq (A∗ ) - Compseq (A),
(ii) Comm(A∗ ) -d Comm(A), and
(iii) Synch(A∗ ) -d Synch(A).
The philosophy of the above definitions is the following: First, we are
not specific about the class of algorithms to which A belongs, as long as the
specified costs can be defined. For example, we variously allow A to be any
implementation of the standard matrix multiplication algorithm, or any com-
parison algorithm for sorting. Second, allowing at each level some efficiency
loss in communication and synchronization is tolerable for problems for which
computational costs dominate asymptotically. It frees the analysis of several
concerns, such as whether the input size is an exact multiple of the memory
sizes. Analogous to the role of the polynomial time criterion in sequential
computing, we believe that freeing the algorithm designer from the tedium
of certain well-chosen optimality criteria will encourage the development of
practical algorithms. In this instance we permit constant factor inefficiencies
in the parallel computation, communication and synchronization costs. How-
ever, for all machines of the same depth d these constant factors are bounded
by the same constants independent of all the other machine parameters. In
all three measures additive lower order terms that have a vanishing relative
contribution as the input size n and m = min{mi |1 ≤ i ≤ d} grow, are also
allowed.
11
external storage device that is large enough to store the input to the problem
at hand. The parameters of the chip according to one interpretation of the
specifications and modulo the serious qualifications listed below, would be
the following:
Level 1: 1 core has 1 processor with 4 threads plus L1 cache: (p1 = 4, g1 =
1, L∗1 = 3, m1 = 8KB).
Level 2: 1 chip has 8 cores plus L2 cache: (p2 = 8, g2 = 3, L∗2 = 23, m2 =
3MB).
Level 3: p multi-core chips with external memory m3 accessible via a
network at rate g2 : (p3 = p, g3 = ∞, L∗3 = 108, m3 ≤ 128GB).
Now the qualifications include the following: First, the L∗ -parameters
listed are certain latency parameters given in the chip specifications, rather
than the actual synchronization costs. Second, the caches on the chip are
controlled by implicit protocols, rather than explicitly by the programs as is
customary for memories. Third, in the actual chip each physical processor
runs four threads, and groups of processors share a common arithmetic unit.
For all these reasons, while the relative magnitudes of the various g and L∗
values shown may be meaningful, their absolute values for this instance are
harder to pin down.
We have three main observations about the requirements on architectures
to support Multi-BSP.
The first is that barrier synchronization needs to be supported efficiently.
There are reports in the literature suggesting that this can be done already
with current architectures [32].
The second is that the model controls the storage explicitly. It is not
clear how the various cache protocols currently used are supportive of the
model.
The third is that we only need that machines support the features of
this model – no constraint is implied about what else they might support.
Thus there is no objection to machines being able to switch off the barrier
synchronization mechanism, or having as additional mechanisms some of
those that have been advocated as alternatives in other related models. Our
proposal here concerns the bridging model – a minimal description on which
architects and algorithm designers can come together.
12
4. Work-limited Algorithms
Our proofs of optimality for communication and synchronization given
in this section and the one to follow all derive from lower bounds on the
number of communication steps required in distributed algorithms and are
direct applications of previous work, particularly of Hong and Kung [23],
Aggarwal and Vitter [3], Savage [33, 34] and Irony, Toledo and Tiskin [25].
Definition A straight-line program A is w(S)-limited if every subset of its
operations that uses at most S inputs (i.e. excluding data items generated
by the subset) and produces at most S outputs (i.e. items of data that are
used by operations of the program outside the subset, or are the outputs of
the program) consists of no more than w(S) operations.
Note that this limitation to S items imposes the twin constraints that at
most S items can be used as data by the subset of the operations, and at
most S items can be used to pass values computed by that subset to later
computation steps. (As a minor comment we note also that our definition
is slightly less onerous than those of red/blue pebbling and span [23, 33],
which, in addition, restrict to S the space that can be used internally by the
algorithm fragment.)
We first consider the task of associative composition AC(n): Given a
linear array A of n elements from a set X, an associative binary operation ⊗
on X, and a specific set of disjoint contiguous subarrays of A, the object is
to compute the composition of each subarray under ⊗ in some order, where
the only operation allowed on elements of X is ⊗.
Proposition 4.1. For any n and S, any algorithm for associative composi-
tion AC(n) is (S − 1)-limited.
Proposition 4.2. For any n and S, the standard matrix multiplication al-
gorithm MM(n × n) is S 3/2 -limited.
13
Proof This is observed by Irony, Toledo and Tiskin [25, Lemma 2.2] by ap-
plying the Loomis-Whitney inequality to matrix multiplication as originally
suggested by M. Paterson, and is also implicit in [23, 33] ¤
Proposition 4.3. For any n and S, the standard Fast Fourier transform
algorithm FFT(n) is 2S log2 S-limited.
Proof This has been shown by Aggarwal and Vitter[3, Lemma 6.1]. ¤
5. Lower Bounds
Our lower bound results for straight-line programs we derive using the
approach of Irony, Toledo and Tiskin [25] (and also of [23, 33]), while the
result for sorting uses an adversarial argument of Aggarwal and Vitter [3].
The bounds will be stated for Multi-BSP but the lower bound arguments for
communication hold more generally, for all distributed algorithms with the
same hierarchy of memory sizes and costs of communication, even if there is
no bulk synchronization.
Lemma 5.1 Suppose W computation steps are executed of a w(S)-limited
straight-line program on a Multi-BSP machine. Then for any j the total
number of words transmitted between level j components and the level j + 1
components to which they belong is at least
Mj (W/w(2Mj ) − Qj ), (4)
14
and the total number of level j component supersteps at least
W/w(Mj ). (5)
Proof For each level j component divide the computation into phases,
where each phase ends when the total number of messages sent to or received
from level j + 1 reaches Mj . In each phase therefore at most 2Mj words are
available, including those residing in memory before the start of the phase.
Then at most w(2Mj ) operations can be performed by any execution sequence
in one phase. It follows that the total number of such component phases is at
least W/w(2Mj ). Further, each of these component phases, except possibly
the last one for each component, must complete, and involve a movement
of Mj data for each of the Qj such components. Hence the total amount of
data movement between level j and level j + 1 is at least as claimed in (4).
By the same argument, since in a component superstep at most Mj mem-
ory is available, at most w(Mj ) operations can be performed, and hence at
least W/w(Mj ) component supersteps are needed, which gives (5). ¤
X
Comm(n, d) %d (W (n)/(Qi w(2Mi )) − 1)Mi gi , (6)
i=1··d−1
X
Synch(n, d) %d W (n)Li+1 /(Qi w(Mi )). (7)
i=1··d−1
Proof This follows from Lemma 5.1 by adding the costs over all the levels.
Consider the Q1 paths from the level 1 components to the level d component
in the tree hierarchy as potential critical paths of the executions. The average
load on these, and hence the worst case also, is as claimed in (6) and (7). ¤
Corollary 5.1.
15
X
AC-Comm(n, d) %d (n/(Mi Qi ) − 1)Mi gi , (8)
i=1··d−1
X
AC-Synch(n, d) %d nLi+1 /(Qi Mi ), (9)
i=1··d−1
X 3/2
MM-Comm(n × n, d) %d (n3 /(Qi Mi ) − 1)Mi gi , (10)
i=1··d−1
X 3/2
MM-Synch(n × n, d) %d n3 Li+1 /(Qi Mi ), (11)
i=1··d−1
X
FFT-Comm(n, d) %d (n log n/(Qi Mi log Mi ) − 1)Mi gi , (12)
i=1··d−1
X
FFT-Synch(n, d) %d n log nLi+1 /(Qi Mi log Mi ), (13)
i=1··d−1
X
Sort-Comm(n, d) %d (n log n/(Qi Mi log Mi ) − 1)Mi gi , (14)
i=1··d−1
X
Sort-Synch(n, d) %d n log nLi+1 /(Qi Mi log Mi ). (15)
i=1··d−1
Proof Applying Theorem 5.1 directly gives the first six inequalities.
The bounds for sorting follow from the following adversarial argument
adapted from Aggarwal and Vitter [3]: Let S = Mi . Consider any total
ordering of all the level i component supersteps that respects all the time
dependencies. As we go through these component supersteps in that order
we shall adversarially define a sequence of successively stricter partial orders
on the input set X as may be discovered by the comparisons made in the
successive component supersteps. In each such component superstep that
16
S
inputs a set Y of r elements of X we shall identify a total ordering on Y Z,
where Z is the subset of X that was present in the component S at the start
of that component superstep. The total order identified for Y Z will be
one that maximizes the number of partial orders on X still consistent with
both this total order as well as the total orders found on subsets in previous
stages of this adversarial process. Since the total order on Z was found in an
earlier component superstep, the number of choices available is no more than
S r , which upper bounds the number of ways Y can be inserted into a sorted
Z . Hence the logarithm to the base S of the number of partial orders on X
still consistent after this component superstep is reduced
S by at most r, at the
cost of r inputs, if we pick the total order on Y Z that is consistent with
the most partial orders on X. Since this logarithm initially is n logS n, the
total number of inputs must be at least this number, minus the total number
of elements inside the level i components at the start, which is SQi . This
establishes the lower bound on communication. For the final synchronization
bound we observe that any component superstep can depend on at most S
input elements. ¤
6. Optimal Algorithms
We shall describe algorithms that meet the lower bounds on communica-
tion and synchronization given in the previous section but with Mj replaced
by mj . Under the assumption that (2) holds, Mj can be replaced by mj in
lower bounds and hence our upper and lower bounds then match. For each
of the algorithms described it is easy to verify that the conditions Comp(A∗ )
-d Comp(A) and Compseq (A∗ ) - Compseq (A) on computation steps are sat-
isfied, and we shall not comment on them further.
A level j component superstep will refer to what a single level j component
performs in a level j superstep. For simplicity of description we assume
throughout that every fraction referenced, such as mj /mj−1 , has integral
value.
17
mj /mj−1 results back, it performs up to mj /mj−1 further pairwise ⊗ op-
erations recursively so as to derive the composition of the subsequences Z
assigned to it.
The costs of the recursion at one level j component can be divided into
(i) the data movement steps between the level j component and its level
j − 1 subcomponents, and (ii) the recursive computation of the mj /mj−1
further pairwise ⊗ operations. For (i) since in the overall computation a
level j − 1 memory has to be filled with information from level j (and one
word returned) at most mj /mj−1 times, the parallel cost of communication
at this level is at most
- mj Lj /(pj mj−1 ).
In addition mj /(pj mj−1 ) component supersteps of AC(mj−1 , j − 1) will need
to be executed.
For (ii) we observe that the cost corresponds to the original problem for a
level j component superstep, but for input length mj /mj−1 rather than mj .
In other words its costs are AC-Comp(mj /mj−1 , j), AC-Comm(mj /mj−1 , j)
and AC-Synch(mj /mj−1 , j).
Hence, for the overall algorithm
AC-Comm(mj , j)
- mj gj−1 /pj +(mj /(pj mj−1 ))AC-Comm(mj−1 , j−1)+AC-Comm(mj /mj−1 , j)
and
AC-Synch(mj , j)
- mj Lj /(pj mj−1 )+(mj /(pj mj−1 ))AC-Synch(mj−1 , j−1)+AC-Synch(mj /mj−1 , j)
Consider the first of these equations. Note that this expression for AC-
Comm(mj , j) could have been derived and holds for any m ≤ mj . We can
therefore expand this expression recursively by substituting it repeatedly for
the last term AC-Comm(mj /mj−1 , j), first with m = mj /mj−1 , then with
m = mj /(mj−1 )2 , etc., to obtain
18
AC-Comm(mj , j)
- (mj gj−1 /pj + (mj /(pj mj−1 ))AC-Comm(mj−1 , j − 1))(1 + 1/mj−1 +
1/m2j−1 · · · )
-d (mj gj−1 /pj + (mj /(pj mj−1 ))AC-Comm(mj−1 , j − 1)).
X
AC-Comm(n, d) -d ngi /Qi . (16)
i=1··d−1
X
AC-Synch(n, d) -d nLi+1 /(Qi mi ). (17)
i=1··d−1
19
3/2 3/2
In a level j component superstep there will be mj /(mj−1 ) level j − 1
1/2 1/2
component supersteps of mj−1 × mj−1 matrix multiplications. In addition we
1/2 1/2 3/2 1/2
will charge to this level the further mj (mj /mj−1 ) = mj /mj−1 additions
1/2 1/2
needed to combine the results of the mj−1 × mj−1 multiplications of the level
1/2 1/2
j −1 local supersteps. For the latter operations we will use mj /mj−1 succes-
sive Associative Composition operations AC(mj , j) we analyzed earlier, each
1/2 1/2
such operation performing compositions on sets of mj mj−1 subarrays each
1/2 1/2
of size mj /mj−1 . Hence using (16) and Qj ≥ 1, the total communication
cost we charge that derives from this j th level of recursion is
since by (1) the mj are nondecreasing in j. Assuming (2) this meets the
lower bound (10).
Similarly, the total charge for synchronization at level j is
since by (1) the mj are nondecreasing in j. Assuming (2) this meets the
lower bound (11).
20
6.3. Fast Fourier Transform
We consider the FFT problem for input size N = 2u as a straight-line
program where each operation corresponds to a node at a layer k ∈ 0, 1, · · · u,
and an operation at layer k is a linear combination of the values produced
at its two antecedent nodes at layer k − 1. The operation sequence can
be represented as a directed acyclic graph with nodes (i1 i2 · · · iu , k) where
ij ∈ {0, 1} and k ∈ {0, 1, · · · , u}, and edges ((i1 i2 · · · iu , k), (i1 i2 · · · iu , k + 1)
and ((i1 i2 · · · iu , k), (i1 i2 · · · i∗k+1 · · · iu , k + 1)) where for i ∈ {0, 1}, i∗ ∈ {0, 1}
is the complement, namely i+i∗ = 1 mod 2. The feature of this network that
we use is that once the first log2 x layers of operations have been computed,
the computation can be completed by doing x independent subproblems on
FFT graphs of size N/x.
Our basic algorithm FFT(mj , x, j) will compute x instances of FFTs on
disjoint sets of mj /x points all initially held in the memory of one level j
component with the output to be held also in that memory. By FFT we shall
mean here a computation on the FFT graph possibly with different constant
multipliers than in the actual FFT. Also, for simplicity of description we are
assuming below that the memory in a level j component is a fixed constant
multiple of mj , and as throughout, that all fractions referenced have integral
values. The algorithm consists of doing the following:
21
+ FFT-Comm(mj , xmj−1 , j)
and
Expanding the first through the r = log mj / log mj−1 levels of recursion gives
that
FFT-Comm(mj , x, j) =
= (mj /(mj−1 pj ))[mj−1 gj−1 + FFT-Comm(mj−1 , 1, j − 1)]
.
.
P
FFT-Comm(mj−1 , 1, j − 1) ≤ mj−1 log mj−1 i=1··j−2 (1/ log mi )gi /Qi,j−1
and substituting in the above using Qi,j−1 pj = Qi,j gives the following as an
upper bound for FFT-Comm(mj , x, j):
P
= (log mj / log mj−1 )gj−1 mj /pj +(mj log mj i=1··j−2 (1/ log mi )gi /Qi,j
22
P
= mj log mj i=1··j−1 (1/ log mi )gi /Qi,j .
Then for n ≤ md
X
FFT-Comm(n, 1, d) -d (n log n)gi /(Qi log mi ). (20)
i=1··d−1
X
FFT-Synch(n, 1, d) -d (n log n)Li+1 /(Qi mi log mi ). (21)
i=1··d−1
More simply, (21) follows from (20) by observing that for the parallel cost
of a level i superstep, while we charge mi gi in (20), the charge is Li+1 in (21).
6.4. Sorting
Our purpose is to show that, within constant multipliers, asymptotically
optimal parallel computation, communication and synchronization costs can
be achieved by a comparison sorting algorithm that makes (1 + o(1))n log2 n
comparisons altogether. Thus for the full range of parameters, optimality in
our sense, that is for large enough mi , can be achieved. No doubt smaller
multipliers for the various costs can be obtained by algorithms that specialize
in different ranges of the parameters, but a full analysis remains a challenging
problem.
Sorting by deterministic oversampling and splitting into smaller subsets
of about equal size is known to be achievable using the following idea [13,
30, 37]:
Lemma 6.1 For numbers N, S, G and t one can find a set of S splitters in
any ordered set X of N elements such that in the ordering on X the number
of elements between two successive splitters is N/S ± 2tG by using the follow-
ing procedure: Partition X into G sets of N/G elements each, sort each such
set, pick out every tth element from each such sorted list, sort the resulting
N/t elements, and finally pick every N/(tS)th element of that.
23
splitters that already split Y into x sublists of about equal size. Our recursive
step will divide the set
√
Y into xmj−1 /t2j sublists of about equal size at the
(log2 mj−1 )
next stage for tj = e . This is achieved by applying the method of
Lemma 6.1 to each of the x sublists of Y with N = (mj /x), S = mj−1 /t2j ,
and G = N/mj−1 .
Assuming first that the sublists are of exactly the same length we get the
recurrence
Sort-Comm(mj , x, j)
- (mj /(mj−1 pj ))(mj−1 gj−1 + Sort-Comm(mj−1 , 0, j − 1))
+ Sort-Comm(mj , xmj−1 /t2j , j) + Sort-Comm(mj /tj , 0, j).
First we observe that it follows from the definition of tj and condition
(3) that for any l > 0, tj = ω((log2 mj )l ). It follows that the last term in
the above recurrence can be ignored since it contributes a lower order term
even if implemented by a sorting algorithm in which the communication cost
is proportional to the computation cost, rather than a logarithmic factor
cheaper as is the overall goal. (Note also that the constant k in (3) will only
influence the threshold at which the m’s can be considered large.)
Omitting that last term leaves the same recurrence as for FFT-Comm
but with the multiplier mj−1 /t2j rather than mj−1 in the second term. Since
log2 (mj−1 /t2j ) ≥ (log2 mj−1 )/2 for large enough mj−1 it will have the following
solutions analogous to (20) and (21):
X
Sort-Comm(n, d) -d (n log n)gi /(Qi log mi ), and (22)
i=1··d−1
X
Sort-Synch(n, d) -d (n log n)Li+1 /(Qi mi log mi ). (23)
i=1··d−1
In fact, the sublists into which the splitters split at any stage, are not of
exactly equal length as the analysis above assumes, but are of lengths within
multiplicative factors of 1 ± 2/tj of each other, or 1±o((log mj )−l ) for any
constant l. This is because the mean of their lengths is N/S = N t2j−1 /mj−1
while the maximum deviation from this length is 2tj−1 G = 2N tj /mj−1 , which
is a fraction 2/tj of that mean. It can be verified that accommodating these
variations in the subset lengths does not change the conclusion.
24
7. Acknowledgements
I am grateful to Alex Gerbessiotis and Alex Tiskin for their comments on
this paper and to Guy Blelloch, Phil Gibbons, Mark Hill, Vijaya Ramachan-
dran and John Savage for conversations.
References
[1] A. Aggarwal, B. Alpern, A. Chandra, and M. Snir. A model for hierar-
chical memory. Proc. of the 19th ACM Symp. on Theory of Computing,
pages 305–314, 1987.
25
[10] O. Bonorden et al. The Paderborn University BSP (PUB) Library, Par-
allel Computing 29:2 (2003) 187-207.
26
[20] M. W. Goudreau, K. Lang, G. Narlikar, S. B. Rao, BOS is Boss: A Case
for Bulk-Synchronous Object Systems, Proceedings of the 11th Annual
ACM Symposium on Parallel Algorithms and Architectures (SPAA 99),
pp 115-125.
[21] J.M.D. Hill et al., BSPLib: The BSP programming library, Parallel
Computing, 24:14 (1998) 1947-1980.
[22] M. D. Hill and M. R. Marty, Amdahl’s Law in the Multicore Era, IEEE
Computer, July 2008.
[23] J. Hong and H. Kung. I/O-complexity: The red-blue pebble game. Pro-
ceedings of ACM Symposium on Theory of Computing, pp 326-333, 1981.
27
[31] V. Ramachandran. Note: Dynamic programming and hierarchical divide
and conquer for hierarchical multi-level memories, Personal communi-
cation. July 2008.
[32] J. Sampson, R. Gonzalez, J.-F. Collard, N. P. Jouppi, M. S. Schlansker:
Fast synchronization for chip multiprocessors. SIGARCH Computer Ar-
chitecture News 33(4): 64-69 (2005)
[33] J. E. Savage, Extending the Hong-Kung Model to Memory Hierarchies.
In Computing and Combinatorics, D.-Z. Du and M. Li, Eds. Springer-
Verlag, 1995, pp. 270-281.
[34] J. E. Savage, Models of Computation, Addison Wesley, Reading, MA
(1998).
[35] J.E. Savage and M. Zubair. A unified model of multicore architectures.
Proc. First Int. Forum on Next-Generation Multicore/Manycore Tech-
nologies , Cairo, Egypt, Nov 24-25, (2008).
[36] S. Sen, S. Chatterjee and N. Dumir, Towards a Theory of Cache-Efficient
Algorithms, J.ACM, 49:6 (2002) 828-858.
[37] H. Shi and J. Schaeffer. Parallel sorting by regular sampling. J. of Par-
allel and Distributed Computing, 14:362-372, 1992.
[38] A. Tiskin. The bulk-synchronous parallel random access machine. The-
oretical Computer Science, 196, 1–2, pp. 109–130, 1998.
[39] L. G. Valiant. A Bridging Model for Parallel Computation. Communi-
cations of the ACM, 33(8), pp. 103-111, August 1990.
[40] U. Vishkin, S. Dascal, E. Berkovich, J. Nuzman, Explicit Multi-
Threading (XMT) bridging models for instruction parallelism, Proc.
1998 ACM Symposium on Parallel Algorithms and Architectures
(SPAA), 140-151.
[41] J. S. Vitter, External Memory Algorithms and Data Structures: Dealing
with Massive Data,” ACM Computing Surveys, 33(2), June 2001, 209-
271.
[42] J.S. Vitter and E.A.M. Shriver. Algorithms for parallel memory II: Hi-
erarchical multilevel memories. Algorithmica, 12(2/3):148-169, 1994.
28