Paper Dvi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Sorting, Searching, and Simulation in the MapReduce

Framework

Michael T. Goodrich1, Nodari Sitchinava2 , and Qin Zhang2


1
Department of Computer Science, University of California, Irvine, USA
[email protected]
2
MADALGO⋆ , Department of Computer Science, University of Aarhus, Denmark
{nodari, qinzhang}@madalgo.au.dk

Abstract. In this paper, we study the MapReduce framework from an algorith-


mic standpoint and demonstrate the usefulness of our approach by designing and
analyzing efficient MapReduce algorithms for fundamental sorting, searching,
and simulation problems. This study is motivated by a goal of ultimately putting
the MapReduce framework on an equal theoretical footing with the well-known
PRAM and BSP parallel models, which would benefit both the theory and prac-
tice of MapReduce algorithms. Our PRAM and BSP simulation results imply
efficient MapReduce solutions for many applications, such as sorting, 2- and 3-
dimensional convex hulls, fixed-dimensional linear programming. All our algo-
rithms take a constant number of rounds under the commonly made assumptions
for the hardware running MapReduce.

1 Introduction

The MapReduce framework [5, 6] is a programming paradigm for designing parallel and
distributed algorithms. It provides a simple programming interface that is specifically
designed to make it easy for a programmer to design a parallel program that can effi-
ciently perform a data-intensive computation. Moreover, it is a framework that allows
for parallel programs to be directly translated into computations for cloud computing
environments and server clusters (e.g., see [16]). This framework is gaining wide-spread
interest in systems domains, in that this framework is being used in Google data cen-
ters and as a part of the open-source Hadoop system [19] for server clusters, which have
been deployed by a wide variety of enterprises3 , including Yahoo!, IBM, The New York
Times, eHarmony, Facebook, and Twitter.
Building on pioneering work by Feldman et al. [9] and Karloff et al. [14], our
interest in this paper is in studying the MapReduce framework from an algorithmic
standpoint, by designing and analyzing MapReduce algorithms for fundamental sort-
ing, searching, and simulation problems. Such a study could be a step toward ultimately
putting the MapReduce framework on an equal theoretical footing with the well-known
PRAM and BSP parallel models.

MADALGO is the Center for Massive Data Algorithmics, a center of the Danish National
Research Foundation.
3
See http://en.wikipedia.org/wiki/Hadoop.
Still, we would be remiss if we did not mention that this framework is not without
its detractors. DeWitt and Stonebraker [7] mention several issues they feel are short-
comings of the MapReduce framework, including that it seems to require brute-force
enumeration instead of indexing for performing searches. Naturally, we feel that this
criticism is a bit harsh, as the theoretical limits of the MapReduce framework have yet
to be fully explored; hence, we feel that further theoretical study is warranted. Indeed,
this paper can be viewed as at least a partial refutation of the claim that the MapRe-
duce framework disallows indexed searching, in that we show how to perform fast and
efficient multi-search in the MapReduce framework.

The MapReduce Framework. In the MapReduce framework, a computation is specified


as a sequence of map, shuffle, and reduce steps that operate on a set X = {x1 , x2 , . . . , xn }
of values:

– A map step applies a function, µ, to each value, xi , to produce a finite set of key-
value pairs (k, v). To allow for parallel execution, the computation of the function
µ(xi ) must depend only on xi .
– A shuffle step collects all the key-value pairs produced in the previous map step,
and produces a set of lists, Lk = (k; v1 , v2 , . . .), where each such list consists of
all the values, vj , such that kj = k for a key k assigned in the map step.
– A reduce step applies a function, ρ, to each list Lk = (k; v1 , v2 , . . .), formed in
the shuffle step, to produce a set of values, y1 , y2 , . . . . The reduction function, ρ, is
allowed to be defined sequentially on Lk , but should be independent of other lists
Lk′ where k ′ 6= k.

The parallelism of the MapReduce framework comes from the fact that each map or
reduce operation can be executed on a separate processor independently of others. Thus,
the user simply defines the functions µ and ρ, and the system automatically schedules
map-shuffle-reduce steps and routes data to available processors, including provisions
for fault tolerance.
The outputs from a reduce step can, in general, be used as inputs to another round
of map-shuffle-reduce steps. Thus, a typical MapReduce computation is described as a
sequence of map-shuffle-reduce steps that perform a desired action in a series of rounds
that produce the algorithm’s output after the last reduce step.

Evaluating MapReduce Algorithms. Ideally, we desire the number of rounds in a MapRe-


duce algorithm to be a constant. For example, consider an often-cited MapReduce al-
gorithm to count all the instances of words in a document. Given a document, D, we
define the set of input values X to be all the words in the document and we then pro-
ceed as follows: the Map step, for each word, w, in the document, maps w to (w, 1).
Then in the Shuffle step, collects all the (w, 1) pairs for each word, producing a list
(w; 1, 1, . . . , 1), with the number of 1’s in each such list equal to the number of times w
appears in the document. Finally, the Reduce step, scans each list (w; 1, 1, . . . , 1), sum-
ming up the number of 1’s in each such list, and outputs pairs (w, nw ) as a final value,
where nw is the number of 1’s in the list for each w. This single-round computation
clearly computes the number of times each word appears in D.
The number of rounds in a MapReduce algorithm is not always equal to 1, however,
and there are, in fact, several metrics that one can use to measure the efficiency of a
MapReduce algorithm over the course of its execution, including the following:

– We can consider R, the number of rounds of map-shuffle-reduce that the algorithm


uses.
– If we let nr,1 , nr,2 , . . . denote the mapper and reducer I/O sizes for round r, so that
nr,i is the size of the inputs and outputs for mapper/reducer i in round r, then we
can define Cr , the communication complexity of round r, to be the total size of
P inputs and outputs for all the mappers
the PR−1 and reducers in round r, that is, Cr =
i nr,i . We can also define C = r=0 Cr – the communication complexity for
the entire algorithm.
– We can let tr denote the internal running time for round r, which is the maximum
internal running time taken by a mapper or reducer in round r, where we assume
tr ≥ maxi {nr,i }, since a mapper or reducer must have a running time that is at
least the size
PR−1 of its inputs and outputs. We can also define total internal running
time, t = r=0 tr , for the entire algorithm, as well.

We can make a crude calibration of a MapReduce algorithm using the following addi-
tional parameters:

– L: the latency L of the shuffle network, which is the number of steps that a mapper
or reducer has to wait until it receives its first input in a given round.
– B: the bandwidth of the shuffle network, which is the number of elements in a
MapReduce computation that can be delivered by the shuffle network in any time
unit.

Given these parameters, a lower bound for the total running time, T , of an imple-
mentation of a MapReduce algorithm can be characterized as follows:

R−1
!
X
T =Ω (tr + L + Cr /B) = Ω(t + RL + C/B).
r=0

For example, given a document D of n words, the simple word-counting MapReduce al-
gorithm given above has a worst-case performance of R = 1, C = Θ(n), and t = Θ(n);
hence, its worst-case time performance T = Θ(n), which is no faster than sequential
computation. Unfortunately, such performance could be quite common in the natural-
language documents. For instance, in the Brown Corpus [15], the word “the” accounts
for 7% of all word occurrences.
Note, therefore, that focusing exclusively on R, the number of rounds in a MapRe-
duce algorithm, can actually lead to an inefficient algorithm. For example, if we focus
only on the number of rounds, R, then the most efficient algorithm would always be the
trivial one-round algorithm, which maps all the inputs to a single key and then has the
reducer for this key perform a standard sequential algorithm to solve the problem. This
approach would run in one round, but it would not use any parallelism; hence, it would
be relatively slow compared to an algorithm that was more “parallel.”
Memory-Bound and I/O-Bound MapReduce Algorithms. So as to steer algorithm de-
signers away from the trivial one-round algorithm, recent algorithmic formalizations of
the MapReduce paradigm have focused primarily on optimizing the round complexity
bound, R, while restricting the memory size or input/output size for reducers. Karloff
et al. [14] define their MapReduce model, MRC, so that each reducer’s I/O size is re-
stricted to be O(N 1−ǫ ) for some small constant ǫ > 0, and Feldman et al. [9] define
their model, MUD, so that reducer memory size is restricted to be O(logc N ), for some
constant c ≥ 0, and reducers are further required to process their inputs in a single pass.
These restrictions limit the feasibility of the trivial one-round algorithm for solving a
problem in the MapReduce framework and instead compel algorithm designers to make
better utilization of parallelism.
In this paper, we follow the I/O-bound approach, as it seems to correspond better
to the way reducer computations are specified, but we take a somewhat more general
characterization than Karloff et al. [14], in that we do not bound the I/O size for reducers
explicitly to be O(N 1−ǫ ), but instead allow it to be an arbitrary parameter:

– We define M to be an upper bound on the I/O-buffer memory size for all reducers
used in a given MapReduce algorithm. That is, we predefine M to be a parameter
and require that ∀r, i : nr,i ≤ M.

We then can use M in the design and/or analysis of each of our MapReduce algorithms.
For instance, if each round of an algorithm has a reducer with an I/O size of at most M ,
then we say that this algorithm is an I/O-memory-bound MapReduce algorithm with
parameter M . In addition, if each round has a reducer with an I/O size proportional to
M (whose processing probably dominates the reducer’s internal running time), then we
can give a simplified lower bound on the time, T , for such an algorithm as

T = Ω(R(M + L) + C/B).

This approach therefore can characterize the limits of parallelism that are possi-
ble in a MapReduce algorithm and it also shows that we should concentrate on the
round complexity and communication complexity of a MapReduce algorithm in char-
acterizing its performance.4 Of course, such bounds for R and C may depend on M ,
but that is fine, for similar characterizations are common in the literature on external-
memory algorithms (e.g., see [1, 3, 4, 18]). In the rest of the paper, when we talk about
the MapReduce model, we always mean the I/O-memory-bound MapReduce model.

Our Contributions. In Section 2 we present a BSP-like computational framework which


we prove to be equivalent to the I/O-memory-bound MapReduce model. This formula-
tion is more familiar in the distributed algorithms community, making the design and
analysis of algorithms more intuitive. The new formulation allows a simple simulation
result of the BSP algorithms in the MapReduce model with no slowdown in the number
of rounds, resulting in straightforward MapReduce implementations of a large number
of existing algorithms for BSP model and its variants.
4
These measures correspond naturally with the time and work bounds used to characterize
PRAM algorithms (e.g., see [12]).
In Section 3 we present simulation of CRCW PRAM algorithms in our general-
ized MapReduce model, extending the EREW PRAM simulation results of Karloff et
al. [14]5 (which also holds in our generalized model). The only prior known simulation
of CRCW PRAM algorithm on MapReduce was via the standard CRCW-to-EREW
simulation (which incurs O(log2 P ) factor slowdown for a P -processor PRAM al-
gorithm) and then applying the EREW simulation of Karloff et al. [14]. In contrast,
our simulation achieves only Θ(logM P ) slowdown in the round complexity, which is
asymptotically optimal for a generic simulation.
Our CRCW PRAM simulation results achieve their efficiency through the use of
an implicit data structure we call invisible funnel trees. It can be viewed as placing
virtual multi-way trees rooted at the input items, which funnel concurrent read and
write requests to the data items, but are never explicitly constructed.
Our simulation results immediately imply solutions with O(logM N ) round and
O(N logM N ) communication complexities to problems of finding convex hull and
solving fixed-dimensional linear programming.
For problems with no known constant time CRCW PRAM solutions we show that
we can design efficient algorithms directly in our generic MapReduce framework. Specif-
ically, in Section 4 using the idea of invisible funnel trees we develop solutions to the
fundamental problems of prefix sums and randomized indexing of the input.
Finally, what is perhaps most unusual about the MapReduce framework is that there
is no explicit notion of “place” for where data is stored nor for where computations
are performed. This property is perhaps what led DeWitt and Stonebraker [7] to say
that it does not support indexed searches. Nevertheless, in Section 5 we show that the
MapReduce framework does in fact support efficient multi-searching – the problem of
searching for a number of keys in a search tree. Our solution builds a low congestion
search structure similar to [10]. However, to keep the communication complexity low,
our structure is smaller, forcing us to process the queries in smaller batches, which we
pipeline to maintain the optimal round complexity.
For ease of exposition let λ = logM N . All our algorithms exhibit O(λ) round and
O(λN ) communication complexities. Note, that in practice it is reasonable to assume
that M = Ω(N ǫ ) for some constant ǫ > 0, resulting in λ = O(1), i.e. constant round
and linear communication complexities for all our algorithms.

2 Generic MapReduce Computations


In this section we define a BSP-like computational model that captures the MapReduce
framework.
Consider a set of computing nodes V . Let Av (r) be a set of items associated with
each node v ∈ V in round r. Av (r) defines the state of v. Let f be a sequential function
defined for all nodes. Function f takes as input the state Av (r) of a node v and returns
a new set Bv (r), in the process destroying Av (r). Each item of Bv (r) is of the form
5
Their original proof was identified for the CREW PRAM model, but there was a flaw in that
version, which could violate the I/O-buffer-memory size constraint during a CREW PRAM
simulation. Based on a personal communication, we have learned that the subsequent version
of their paper will identify their proof as being for the EREW PRAM.
(w, a), where w ∈ V and a is a new item. We define the following computation which
proceeds in R rounds.
At the beginning of the computation only the input nodes v have non-empty states
Av (0). The state of an input node consists of a single input item.
In round r, each node v with non-empty state Av (r) 6= ∅ performs the following.
First, v applies function f on Av (r). This results in the new set Bv (r) and deletion of
Av (r). Then, for each element b = (w, a) ∈ Bv (r), node v sends item a to node w.
Note that if w = v, then v sends a back to itself. As a result of this process, each node
may receive a set of items from others. Finally, the set of received items at each node v
defines the new state Av (r + 1) for the next round. The items comprising the non-empty
states Av (r) after R rounds define the outputs of the entire computation at which point
the computation halts.
The number of rounds R denotes the round complexity of the computation. The to-
tal number of all the items sent (or, equivalently, received) by the nodes Pin each round
r defines the communication complexity Cr of round r, that is, Cr = v |Bv (r)|. Fi-
nally, PR−1 P complexity C of the entire computation is defined as C =
PR−1the communication
r=0 C r = r=0 v |Bv (r)|. Note that this definition implies that nodes v whose
states Av (r) are empty at the beginning of round r do not contribute to the communi-
cation complexity. Thus, the set V of nodes can be infinite. But, as long as only a finite
number of nodes have non-empty Av (r) at the beginning of each round, the communi-
cation complexity of the computation is bounded.
Observe that during the computation, in order for node v to send items to node w
in round r, v should know the label of the destination w, which can be obtained by v
in the following possible ways (or any combination thereof): 1) the link (v, w) can be
encoded in f as a function of the label of v and round r, 2) some node might send the
label of w to v in the previous round, or 3) node v might keep the label of w as part of
its state by constantly sending it to itself.
Thus, the above computation can be viewed as a computation on a dynamic directed
graph G = (V, E), where an edge (v, w) ∈ E in round r represents a possible commu-
nication link between v and w during that round. The encoding of edges (v, w) as part
of function f is equivalent to defining an implicit graph [13]; keeping all edges within
a node throughout the computation is equivalent to defining a static graph. For ease of
exposition, we define the following primitive operations that can be used within f at
each node v:

– create an item; delete an item; modify an item; keep item x (that is, the item x will
be sent to v itself by creating an item (v, x) ∈ Bv (r)); send an item x to node w
(create an item (w, x) ∈ Bv (r)).
– create an edge; delete an edge. This is essentially the same as create an item and
delete an item, since explicit edges are just maintained as items at nodes. This
operations will simplify exposition when dealing with explicitly defined graphs G
on which computation is performed.

The following theorem shows that the above framework captures the essence of
computation in the MapReduce framework.6
6
Due to space constraints, all omitted proofs can be found in Appendix A.
Theorem 1. Let G = (V, E) and f be defined as above such that in each round each
node v ∈ V sends, keeps and receives at most M items. Then computation on G with
round complexity R and communication complexity C can be simulated in the I/O-
memory-bound MapReduce model with the same round and communication complexi-
ties.
The above theorem gives an abstract way of designing MapReduce algorithms.
More precisely, to design a MapReduce algorithm, we define graph G and a sequen-
tial function f to be performed at each node v ∈ V . This is akin to designing BSP
algorithms and is a more intuitive way than defining Map and Reduce functions.
Note that in the above framework we can easily implement a global loop primitive
spanning over multiple rounds: each item maintains a counter that is updated at each
round. We can also implement parallel tail recursion by defining the labels of nodes to
include the recursive call stack identifiers.

3 Simulation Results
BSP simulation. The reader may observe that the generic MapReduce model of the pre-
vious section is very similar to the BSP model of Valiant [17], leading to the following
conclusion.
Theorem 2. Given a BSP algorithm A that runs in R super-steps with a total memory
size N using P ≤ N processors, we can simulate A using R rounds and C = O(RN )
communication in the I/O-memory-bound MapReduce framework with reducer memory
size bounded by M = ⌈N/P ⌉.

CRCW PRAM simulation. In this section we present a simulation of f -CRCW PRAM


model, the strongest variant of the PRAM model, where concurrent writes to the same
memory location are resolved by applying a commutative semigroup operator f on all
values being written to the same memory address, such as Sum, Min, Max, etc.
The input to the simulation of a PRAM algorithm A is specified by an indexed set of
P processor items, p1 , . . . , pP , and an indexed set of initialized PRAM memory cells,
m1 , . . . , mN , where N is the total memory size used by A. For ease of exposition we
assume that that P = N O(1) , i.e. logM P = O(logM N ) = O(λ).
The main challenge in simulating the algorithm A in the MapReduce model is that
there may be as many as P reads and writes to the same memory cell in any given step
and P can be significantly larger than M , the memory size of reducers. Thus, we need
to have a way to “fan in” these reads and writes. We accomplish this by using invisible
funnel trees, where we imagine that there is a different implicit O(M )-ary tree rooted at
each memory cell that has the set of processors as its leaves. Intuitively, our simulation
algorithm involves routing reads and writes up and down these N trees. We view them
as “invisible”, because we do not actually maintain them explicitly, since that would
require Θ(P N ) additional memory cells.
Each invisible funnel tree is an undirected7 rooted tree T with branching factor
d = M/2 and height L = ⌈logd P ⌉ = O(λ). The root of the tree is defined to be
7
Each undirected edge is represented by two directed edges.

You might also like