Early Experiences in Implementing the Buer Tree1
David Hutchinson
Anil Maheshwari
Jorg-Rudiger Sack
Radu Velicescu
School of Computer Science
Carleton University
Ottawa, Ont., Canada K1S 5B6
e-mail fhutchins,maheshwa,sack,
[email protected]
ABSTRACT
Computer processing speeds are increasing rapidly due to the evolution of faster chips, parallel
processing of data, and more ecient software. Users today have access to an unprecedented
amount of high quality, high resolution data through various technologies. This is resulting
in a growing demand for higher performance input and output mechanisms in order to pass
huge data sets from the external memory (EM), or disk system, through the relatively small
main memory of the computer and back again. In recent years, research into external memory
algorithms has been growing to keep pace with the demand for innovation in this area.
EM algorithms for individual problems have been developed but few general purpose EM
tools have been designed. A fundamental tool is the buer tree, an external version of the (a,b)tree. It can be used to satisfy a number of EM requirements such as sorting, priority queues,
range searching, etc. in a straightforward and I/O-optimal manner.
In this paper we describe an implementation of a buer tree. We describe benchmarking
tests which lead to an experimental determination of certain parameter values dierent from
those originally suggested in the design of the data structure. We describe implementations of
two algorithms based on the buer tree: an external memory treesort, and an external memory
priority queue. Our initial experiments with buer tree sort for large problem sizes indicate
that this algorithm easily outperforms similar algorithms based on internal memory techniques.
With some tuning of the buer tree parameters we are able to obtain performance consistent
with theoretical predictions for the range of problem sizes tested. We include comparisons with
TPIE Merge Sort.
We conclude that (a) the buer tree as a generic data structure appears to perform well in
theory and practice, and (b) measuring I/O eciency experimentally is an important topic that
merits further attention.
1. Introduction
The Input/Output bandwidth between fast internal memory and slower secondary storage is
the bottleneck in many large-scale applications, such as multimedia, GIS, land information systems,
seismic databases, satellite imagery, digital libraries, real-time applications and virtual reality. A
typical disk drive is a factor of 105 ; 106 slower in performing a random access than is the main memory of a computer system. Present methodologies for addressing the performance issues involving
secondary storage can be classied as follows 14]:
1 Research
supported by the Natural Sciences and Engineering Research Council of Canada.
92
increasing storage device parallelism, which improves the bandwidth between secondary memory and main memory,
exploiting locality of reference via organization of the data and processing sequence,
overlapping I/O with computation, e.g. using prefetching.
Some of the earliest work in external memory algorithms was done by Floyd 13] and Hong and
Kong 16] who studied matrix operations and fast Fourier transforms. Lower bounds for a number of
problems related to sorting were presented by Aggarwal and Vitter 1]. The classical I/O model was
introduced by Vitter and Shriver 28]. The uniprocessor, single disk version of this model represents
an EM computer system as a processor, some xed amount of internal memory, and a disk. It is
described by the following parameters:
N is the number of elements in the problem instance,
M is the number of elements that can t in the internal memory,
B is the number of elements per block,
where M < N and 1 B M2 .
An Input/Output operation (I/O) is the process of reading or writing a block of B contiguous
data elements to or from the disk. The I/O complexity of an algorithm is dened as the total number
of I/Os that an algorithm performs. It is assumed in this model that the internal computation is
free. Several other I/O models have been proposed, see e.g. 28, 12, 20]. The theoretical framework
of the algorithms in this paper is based on the Parallel Disk I/O Model (PDM) proposed by Vitter
and Shriver 28].
Permutations and sorting have been very widely studied in the context of this model, see 2, 1,
9, 20, 27]. Algorithms for problems in computational geometry 15, 4, 3, 8, 23], graph theory 7,
19, 3], and GIS 4] have been presented. A number of general paradigms for designing external
memory algorithms have been proposed. These include simulation 7, 12, 15, 22], merging 5, 20],
distribution 21, 15, 7], and data structuring 2].
Recently there has been an increasing interest in implementation and experimental research work
targeted to I/O ecient computation. Research work in this area includes:
(i) The TPIE (Transparent Parallel I/O Environment) project of Vengro and Vitter 26, 24] which
aims to collect implementations of existing algorithms within a common framework, and to make
development of new implementations easier.
(ii) Experiments by Chiang 6] with four algorithms for the orthogonal segment intersection problem.
(iii) Cormen et al. 11, 10] have reported on a number of implementation issues and results relating to I/O ecient algorithms, including FFT computations using parallel processors, and FFT,
permutations, and sorting using the Parallel Disk Model.
Motivated by the goal of constructing I/O ecient versions of commonly used internal memory
data structures, Arge 2, 3] proposed the data structuring paradigm, and in particular the Bu er
Tree. A buer tree is an external memory search tree. It supports operations such as insert, delete,
search, deletemin, and it enables the transformation of a class of internal-memory algorithms to
external memory algorithms by exchanging the data structures used. A large number of external
memory algorithms have been proposed 3, 2] using the buer tree data structure, including sorting,
priority queues, range trees, segment trees, and time forward processing. These in turn are subroutines for many external memory graph algorithms, such as expression tree evaluation, centroid
decomposition, least common ancestor, minimum spanning trees, ear decomposition. There are a
number of major advantages of the buer tree approach. It applies to a large class of problems
whose solutions use search trees as the underlying data structure. This enables the use of many
normal internal memory algorithms, and \hides" the I/O specic parts of the technique in the data
structures. Several techniques based on the buer tree, e.g. time forward processing 3], are simpler
than competitive EM techniques, and are of the same I/O complexity, or better, with respect to
their counterparts.
93
Buffer
Root
Internal Node
O(log n)
m
Leaf Node
Leaves
(1 Disk Block Each)
Figure 1: The buer tree
In this paper, we describe the issues arising out of our implementation of the buer tree. We
present an implementation of the buer tree, and show the exibility and generality of the structure
by implementing EM sorting and an EM priority queue. To test the eciency of our implementation
we used sorting as an example. For data sets larger than the available main memory, our implementation of buer tree sort outperforms internal memory sort (e.g. qsort) by a large and increasing
margin. We use an EM merge sort algorithm from TPIE to provide comparative performance results
for the larger problem sizes. We observe that certain parameters suggested in 2, 3] may not provide
the best results in practice. By tuning these parameters we obtained improved results while maintaining the same asymptotic worst case I/O complexity. The buer tree is a conceptually simple
data structure, and it turned out that implementation of these applications based on the buer tree
was straightforward. Therefore we can support the claim that the buer tree is a generic EM data
structure that performs well in theory and practice.
While the buer tree gives us an I/O-optimal sort, our timing studies of the implementation
indicate that its performance is sensitive to some nonlinearities in the environment or algorithm.
Experimental results show that these non-linearities are reduced by an optimal choice of parameters.
2.
The Buer Tree
In this section we describe the buer tree data structure of Arge 2, 3], with two update operations,
namely insert and delete. Subsequently, we will discuss how we can perform sorting and maintain a
priority queue using a buer tree.
Let N be the total number of update operations, M be the size of the internal memory and B
be the block size and set m = M=B and n = N=B. The buer tree is an (a b)-tree 17], where
a = m=4 and b = m, augmented with a buer in each node of size (m) blocks. Each node (with
the exception of the root) has a fan out (number of children) between m=4 and m. Each node also
contains partitioning elements, or \splitters" which delimit the range of keys that will be routed
to each child. The number of splitters is one less than the fanout of the node. The height of the
buer tree is O(logm n) (see Figure 1). Since the buer tree is an extension of the (a b)-tree, the
computational complexity analyses of the various (a b)-tree operations still apply. The buers are
used to defer operations, to allow their execution in a \lazy manner", thus achieving the necessary
blocking for performing operations eciently in external memory. A buer is full if it has more
than m=2 blocks.
For any update operation, a request element is created, consisting of the record to be inserted or
deleted, a ag denoting the type of the operation, and an automatically generated time stamp. Such
request elements are collected in the internal memory until a block of B requests has been formed.
The request elements, as a block, are inserted into the buer of the root using one I/O. If the buer
of the root contains less than m=2 blocks there is nothing else to be done in this step. Otherwise
the buer is emptied by a bu er-emptying process.
The buer-emptying process at an internal node requires O(m) I/Os, since we load m=2 blocks
into the internal memory and distribute the elements among the (m) children of that node. A
buer-emptying process at a leaf may require rebalancing the underlying (a b)-tree. An (a b)-tree
is rebalanced by performing a series of \splits" in the case of an insertion or a series of \fuse" and
\share" operations in the case of a delete 17]. Before performing a rebalance operation, we ensure
94
that the buers for the corresponding nodes are empty. This is achieved by rst doing the bueremptying process at the node involved. The deletion of a block may involve the initiation of several
buer-emptying processes. By using dummy blocks during the deletion process, a buer emptying
process can be protected from interference by other processes. (See 2] for details.)
The analysis (i.e., the I/O complexity) of operations on the buer tree is obtained by adapting the
amortization arguments for (a b) trees 17]. Each update element, on insertion into the root buer,
is given O( logBm n ) credits. Each block in the buer of node v holds O(the height of the tree rooted
at v) credits. For an internal node, its buer is emptied only if it gets full and moreover this requires
O(m) I/O's. Therefore, ignoring the cost of rebalancing, the total cost of all buer emptying on
internal nodes is bounded by O(n logm n) I/Os. The total number of rebalance operations required in
an (a b)-tree, where b > 2a, over K update operations on an initially empty (a,b)-tree, is bounded
by K=(b=2 ; a). Therefore, for N update operations, on an (m=4 m)-tree, the total number of
rebalance operations is bounded by O(n=m). Moreover, each rebalance operation may require a
buer-emptying process as well as updating the partitioning elements, and therefore may require up
to O(m) I/Os. Thus the total cost of rebalancing is O(n) I/Os. The cost of emptying leaf nodes is
bounded by the sorting operation. We summarize.
Theorem 1. (Arge 2, 3]) The total cost of an arbitrary sequence of N intermixed insert and delete
operations on an initially empty bu er tree is O(n logm n) I/O operations.
2.1. Sorting
A buer tree can be used to sort N items as follows. First insert N items into the buer tree
followed by an empty/write operation. This is accomplished by performing a buer-emptying process
on every node starting at the root, followed by reporting the elements in all the leaves in the sorted
order. This can be done within the complexity of computing the buer tree data-structure.
Corollary 1. (Arge 3]) N elements can be sorted in O(n logm n) I/O operations using the bu er
tree.
The PDM compares competing EM algorithms according to the asymptotic number of I/O
operations they require to solve a given problem of size N. By this model, the buer tree sorting
algorithm 3] is optimal, as the number of I/O operations matches the lower bound (n logm n)
for the sorting problem 1].2 In practice, however, other factors can also aect the running time.
Cormen and Hirschl 10] observe that many PDM applications are not I/O bound, which suggests
that CPU time is an important factor to be considered. A model which includes both I/O and CPU
time is presented in 12].
2.2. Priority Queues
A dynamic search tree can be used as a priority queue, since in general, the leftmost leaf of the
search tree contains the smallest element. We can use the buer tree for maintaining a priority queue
in external memory by permitting the update operation described previously for insertion into the
priority queue and adding a deletemin operation. It is not necessarily true that the smallest element
is in the leftmost leaf in the buer tree, as it could be in the buer of any node on the path from
the root to the leftmost leaf. In order to extract the minimum element, i.e., execute the deletemin
operation, a buer-emptying process must rst be performed on all nodes on the path from the root
to the leftmost leaf. After the buer emptying the leftmost leaf consists of the B smallest elements,
and the children of the leftmost node in the buer tree consists of atmleast the m4 B smallest elements.
These elements can be kept in the internal memory, and at least 4 B deletemins can be answered
without doing any additional I/O. In order to obtain correct results for future deletemins, any new
insertion/deletion must be checked rst with these elements in the internal memory. This realization
of a priority queue does not support the changing of priorities on elements already in the queue.
Theorem 2. (Arge 3]) The total cost of an arbitrary sequence of N insert, delete and deletemin
operations on an initially empty bu er tree is O(n logm n) I/O operations.
2 For a single disk. n is the number of disk blocks in the problem. m is the number of disk blocks that t into the
memory size M .
95
3. Implementation Issues
3.1. Implementation of the Bu er Tree
3.1.1. The (a,b)-Tree
The buer tree is an (a,b)-tree with buers added to each tree node. One source of code for an
(a,b)-tree is LEDA 18]. It quickly became clear, however, that this code was designed specically
for internal memory usage, as nodes were linked by many pointers to support a wide range of
higher level operations. Converting the various pointers of the (a,b)-tree implementation to external
memory representations turned out to be time consuming and ineective for a data structure that
was required to be I/O ecient. Too many I/O operations were required to update a single eld in
a child or parent of the node in memory to give attractive EM performance.
3.1.2. The Bu ers
Each node of the buer tree has an associated buer, which may contain between 0 and m=2
blocks (between 0 and kM=2 bytes) of data3 which have not yet been inserted into the (a,b)-tree
part of the buer tree. The fanout of an internal node is at most m. Therefore, for a buer tree
consisting of ` levels, there may exist up to m`;1 buers, each kM=2 bytes in size.
Our implementation currently models each buer as a Unix le. Due to restrictions on the
number of Unix les that could be open at a time, each buer le is closed after use and reopened
when necessary. The time required by le open and close operations, as measured in our tests, was
small. However, preliminary experiments suggest that this scheme may limit the eectiveness of
asynchronous I/O, since the le close operation must wait for any outstanding I/O to complete. In
this paper we report primarily on our experiences using synchronous I/O.
3.1.3. Compatibility, Usability and Accessibility
We wanted our implementation to be compatible with the use of LEDA 18] and with TPIE 26,
25]. The large number of algorithms and data structures available in LEDA forms an attractive
context for implementation of internal memory algorithms, which are often components of a larger
external memory system. For example, the external memory priority queue uses an internal memory
priority queue as a component. The collection of ecient external memory techniques provided by
TPIE forms an attractive workbench for building and testing new EM implementations. We expect
to use TPIE services in a later version of our implementation and perhaps oer it as an addition
to the library. We chose C++ as our implementation language to preserve potential compatibility
with these code libraries. In addition, we adopted the automated documentation tools from LEDA.
3.2. Implementation of an EM Sort Using the Bu er Tree
A buer tree can be made to sort a data set simply by inserting the data into the tree, and then
force-emptying the buers. Our implementation allows the leaves to be read sequentially from left
to right, thus the implementation of sorting was straightforward.
3.3. Implementation of an EM Priority Queue
The buer tree can be modied to construct an I/O-optimal priority queue with insert and
deletemin operations in EM 3]. Our implementation is obtained as follows:
3
The leftmost leaf node of the buer tree, together with its associated m4 to m leaf blocks are
kept in internal memory, instead of on disk. The (a,b)-tree nodes on the path from the root
to the leftmost leaf are also kept in memory. For typical values of M, N, B, a, and b these
(a,b)-tree nodes make a negligible impact on internal memory consumption.
The data records in memory are organized using an appropriate internal memory priority
queue. In our initial implementation, this is a conventional heap, originally obtained from
LEDA.
This may increase temporarily during a buer emptying process.
96
Any requests that would normally be inserted into the root buer of the buer tree are rst
compared to the leftmost splitters of the (a,b)-tree nodes on the path from the root to the
leftmost leaf. As argued above, this gives a small constant number of comparisons in practice.
If a request would be routed to the cached data blocks, it is inserted directly into the internal
heap. Otherwise, it is inserted into the buer tree in a \normal" fashion.
The balance of the underlying (a,b)-tree is maintained in a normal fashion. If the leftmost leaf
node underows, sharing or fusing with siblings will occur. If it overows, splitting may occur
as it would in an (a,b)-tree.
3.4. Testing Platform and Parameters
We performed most of our development and experimental work on a network of sixteen 166 MHz
Pentiums, each with 32 MB of internal memory and a pair of 2 GB hard disks used exclusively for
data storage. A central 1 GB hard disk is available via NFS for program storage. The processors
each run the Linux operating system. Although we did not use it except for NFS access to the
central program storage, the processors are interconnected by a fast ethernet switch. We found that
our experimental timings were reproducible between processors, and so we were able to run multiple
timing tests independently, simultaneously, and reliably on this platform.
We chose k = 16 bytes as the record size for our tests. A record (sometimes we will call it an
element) consisted of four integer (4 byte) elds: a key, an associated \data eld", a timestamp, and
an operation type (insert/delete/query) eld. We chose kB = 4096 bytes, since this was the system
page size. We chose kM = 500KB (m=125), which is small, but allowed us to choose manageable
problem sizes (from a disk space point of view) yet still apply stress to our algorithms.
3.5. Testing
In order to obtain meaningful performance results, we attempted to control the following factors:
Contention with other processes or users of the machine for machine resources such as memory,
CPU and disks: We perform the testing on dedicated machines, and so the results were not aected
by other users. The Unix operating system spontaneously initiates various system tasks to perform
routine maintenance and monitoring functions, thus it is not easy to avoid contending with these
tasks for system resources. Smaller test runs may vary signicantly due to these eects. However,
on the larger test runs, the inuence of system tasks on the run time can generally be ignored.
Virtual memory e ects such as the `transparent' behaviour of the operating system to swap parts
of the program image between main memory and secondary storage: The swapping of portions of a
task to disk to make room for another activity can occur without warning or notication. In our
tests we attempted to minimize the likelihood of this occuring by choosing kM, the problem size in
bytes, to be much smaller than the physical memory size. For instance, on our Linux machines with
32MB of physical memory, we performed the majority of our tests using kM = 500KB. Another
tactic is to use the Linux mlockall service call to lock the application into memory. This seems to
work reasonably well in some cases, and does allow the application to use up to 50% of the physical
memory without fear of being aected by virtual memory eects. However, the requesting program
must be running with \root" privileges for the request to be honoured.
3.6. Test Results
3.6.1. Comparison to Quicksort
We found that the buer tree easily outperformed the \built-in" internal memory quicksort
technique. A simple quicksort program was written using the built-in C function \qsort". Figure 2
shows the results for a range of problem sizes. For larger input sets, the (recursive) quicksort program
ran out of stack space on our system, but by that time the internal sort was slowing due to virtual
memory eects and the buer tree was already outperforming it.
3.6.2. Tuning the Bu er Tree
We discovered that the value of b relative to m is important to the performance of buer tree sort
on random data. For buers of size m2 and (a b) = ( m4 m) as suggested in 3], we obtain (partial)
97
350
Internal Quicksort
Buffer Tree
300
Time (sec)
250
200
150
100
50
0
1
1.2
1.4
1.6
1.8
2
2.2
Number of Items (x 1,000,000)
2.4
2.6
2.8
Figure 2: Timings for Internal Quicksort and Buer Tree Sort on Random Inputs: For problem
sizes larger than 2.8 million items the internal quicksort failed due to lack of internal memory.
Different Fanouts (IOSTREAM)
25000
Time (sec)
20000
IOS b=m
IOS b=m/7
IOS b=m/8
IOS b=m/9
TPIE Merge Sort
15000
10000
5000
0
0
5
10
15
20
25
30
Millions of Data Elements
35
40
45
Figure 3: Timings for Buer Tree Sort and TPIE Merge Sort: TMS ran out of disk space at
about 35 million elements because it keeps its original data le. BTS does not require that all of
the input data be available before it begins, and does not require a separate le for the original
data.
block sizes of approximately B2 keys pushed to the next level for each of (perhaps) m children of a
node whenever the parent's buer is emptied. Reducing the fanout, while maintainingmthem buer
8 ) gave
size increases the expected number of elements in each block. We found that (a b) = ( 32
the best performance in our tests. Smaller or larger values of b resulted in longer run times. Figure 3
shows running time curves for Buer Tree Sort (BTS) and for TPIE Merge Sort (TMS). Results for
BTS are shown for several values of b, where a = b=4 in all cases. Both TMS and BTS are running
with synchronous I/O and single buering. The TPIE MMB stream option is used.
The Buer Tree
is using the `iostream' access method. BTS performance is best for about b = m8 and gets worse if b
diers much from this value. BTS with b = m=8 has running times that change nearly linearly with
the problem size for the problem sizes shown.
We caution that our experiments focussed on nding support for the predictions of asymptotic
behaviour of Buer Tree Sort. The actual running times of BTS may be improved by further tuning,
and the performance of TPIE Merge Sort may improve with other choices of parameters and options.
We found the increased speed with smaller b intriguing, and so we counted the number of block
pushes performed by the algorithm. A block push occurs when data is pushed to a child buer after
the parent's buer becomes full. It may consist of a partial block, a full block, or more than a block
98
Block Pushes for Different Fanouts
300000
Number of Block Pushes
250000
b=m
b=m/3
b=m/5
b=m/6
b=m/7
b=m/8
b=m/9
200000
150000
100000
50000
0
0
5
10
15
20
25
30
Millions of Data Elements
35
40
45
Figure 4: Number of Block Pushes for Buer Tree Sort on Random Inputs
of request elements. Reducing the fanout increases the expected size of the data in a block push, and
therefore may reduce the number required, and the number of I/O operations as a result. Figure 4
shows the relationship between several fanout values b and the number of block pushes over a range
of input sizes. The reduction in block pushes seems to be the major reason for the improvement in
running time between b = m and b = m8 .
3.6.3. Non-linearities in the Running Time
We observed that contrary to predictions of the I/O model, for random input data our buer tree
sort implementation tended to have non-linear run times as the problem size increased. (Actually,
we expect the running time to increase more than linearly by a logarithmic factor. However, since
the base of this logarithm is large, the predicted increase in running time is close to linear for the
range of problem sizes considered.)
Figure 5 shows a graph of problem size versus runtime for random input data and b = m. Also
shown in this graph are curves for the various activities of the buer tree, i.e., a breakdown of
where this time is spent. The total running time appears to be increasing superlinearly with the
problem size. Total Running Time is the sum of running time for Insertions plus Force Empty All
Bu ers. Running time for Insertions is composed of the sum of Insertion: Empty Internal Bu ers
plus Insertion: Empty Leaf Bu ers. Both of these seem to be more than linear with the problem
size. However, referring to Figure 4, the number of block pushes is not increasing superlinearly.
Adjusting the parameter b in the buer tree both reduced the number of I/O operations performed
by BTS and apparently removed the non-linear behaviour in our tests. Figure 6 shows the same
graph as Figure 5 for b = m=8. In contrast to Figure 5, the Total Running Time curve is quite linear
after about 10 million elements. The component curves in the gure are equally well behaved.
While the constant represented by the slope of the running time curve is larger for BTS than for
TMS, we note that BTS is an online sorting technique and therefore addresses a dierent situation
than does TMS. (See Figure 7).
3.6.4. Experiments with Parallel Disks
We experimented briey with storing the buer tree on multiple disks, by striping the buers and
leaves across two disks. We obtained a multiple disk driver (the \PDM API") from Tom Cormen
at Dartmouth College, and ported it from the DEC Alpha environment to Linux without much
diculty. To manage concurrent disk access, the PDM API requires a Posix threads implementation,
which we obtained from Florida State University. Perhaps due to the large data cache maintained
by Linux, we found that large data volumes were required before two parallel disks outperformed
a single disk for a simple \write-as-quick-as-you-can" application. For buer tree sort this would
require a single block push to be very large. We tried increasing m to allow this and did see
marginally better performance with two disks for moderate values of n. Unfortunately, as n grew
99
Breakdown of Buffer Tree Activities b=m
25000
Time (sec)
20000
Total Running Time
Insertion
Force Empty Buffers
Insertion:Empty Internal Buffers
Insertion:Empty Leaf Buffers
15000
10000
5000
0
0
5
10
15
20
25
30
Millions of Data Elements
35
40
45
Figure 5: Timings for Buer Tree Sort on Random Inputs. (a b) = ( m4 m), m = 125, B = 256.
Breakdown of Buffer Tree Activities b=m/8
12000
Total Running Time
Insertion
Force Empty Buffers
Insertion:Empty Internal Buffers
Insertion:Empty Leaf Buffers
10000
Time (sec)
8000
6000
4000
2000
0
0
5
10
15
20
25
30
Millions of Data Elements
35
40
45
m m ), m = 125, B = 256.
Figure 6: Timings for Buer Tree Sort on Random Inputs. (a b) = ( 32
8
Pipelined Sort
12000
Total Running Time
Insertion
Force Empty Buffers
TPIE Merge Sort
10000
Time (sec)
8000
6000
4000
2000
0
0
5
10
15
20
25
30
Millions of Data Elements
35
40
45
Figure 7: Pipelining sort with generation of inputs: if the generation of the inputs is suciently
time consuming, Buer Tree Sort can provide a speed advantage over oine methods by permitting the insertion time to be hidden by the time to generate its inputs. The time to Force Empty
Buers then may be the only time that remains \visible".
100
Comparison of PDM and Iostream Access Methods
25000
Time (sec)
20000
IOS b=m
PDM b=m
IOS b=m/8
PDM b=m/8
15000
10000
5000
0
0
5
10
15
20
25
30
Millions of Data Elements
35
40
45
Figure 8: Running Times of PDM and Iostream I/O Access Methods: The PDM uses asynchronous I/O and this seems to give it an advantage for the larger fanouts
towards a more interesting size we began to see our performance degrade, apparently due to virtual
memory eects. We concluded that we needed more real memory for this sort of experiment.
Figure 8 shows running times for a single disk under the PDM API and C++ iostream access
methods. The PDM API could be expected to be slightly slower as it introduces
some extra computation such as its use of threads. This seems to be true in the case of b = m8 , but its ability to overlap
computation with I/O (asynchronous I/O) seems to allow it to outperform in the case b = m.
4.
Conclusions
In this paper we describe an implementation of a buer tree and two EM algorithms based on
the buer tree: an external memory treesort, and an external memory priority queue.
Our tests on random input sets lead to an experimental determination of parameter values
dierent from those originally suggested in the design of the data structure.
Although the running times of our treesort implementation (BTS) with parameter b = m clearly
show non-linearities, b = m8 produced a running time curve which is for practical purposes a straight
line when the problem size is more than 10 million elements. The application was also heavily I/O
bound. This supports the prediction of the algorithm 3] and the model 28] that the asymptotic
running time is (n logm n) I/Os and the number of I/O operations is the dominant issue in the
algorithm.
The non-linear behaviour of BTS with parameter b = m was manifested to various degrees
in some of the other fanouts which we tried. While we expected some eect on running time
as this parameter was varied, the sensitivity to non-linearity is troubling and we do not rule out
implementation decisions as a possible cause. We hope that by further unit testing and performance
measurements of the various components we will soon be able to explain this behaviour.
We conclude that (a) the buer tree as a generic data structure appears to perform well in
theory and practice, and (b) measuring I/O eciency experimentally is an important topic that
merits further attention.
4.1. Acknowlegements
We would like to thank Lars Arge and Je Vitter for their encouragement and interest in this
work, Tom Cormen for providing the PDM API, and Doron Nussbaum and Darren Vengro for
helpful discussions.
101
References
1] A. Aggarwal and J. S. Vitter. The Input/Output complexity of sorting and related problems.
CACM, 31(9):1116{1127, 1988.
2] L. Arge. The buer tree: A new technique for optimal I/O-algorithms. In Proc. Workshop on
Algorithms and Data Structures, LNCS 955, pages 334{345, 1995.
3] L. Arge. Ecient External-Memory Data Structures and Applications. PhD thesis, University
of Aarhus, 1996.
4] L. Arge, D. E. Vengro, and J. S. Vitter. External-memory algorithms for processing line
segments in geographic information systems. In ESA, LNCS 979, pages 295{310, 1995.
5] R. D. Barve, E. F. Grove, and J. S. Vitter. Simple randomized mergesort on parallel disks. In
Proc. ACM SPAA, 1996.
6] Y.-J. Chiang. Experiments on the practical I/O eciency of geometric algorithms: Distribution
sweep vs. plane sweep. In Proc. Workshop on Algorithms and Data Structures, LNCS 955, pages
346{357, 1995.
7] Y.-J. Chiang et al. External-memory graph algorithms. In Proc. ACM-SIAM Symp. on Discrete
Algorithms, pages 139{149, 1995.
8] Yi-Jen Chiang. Dynamic and I/O-Ecient Algorithms for Computational Geometry and Graph
Problems: Theoretical and Experimental Results. PhD thesis, Brown University, August 1995.
9] Thomas H. Cormen. Virtual Memory for Data Parallel Computing. PhD thesis, Department
of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 1992.
10] Thomas H. Cormen and Melissa Hirschl. Early Experiences in Evaluating the Parallel Disk
Model with the ViC* Implementation. Technical Report PCS-TR96-293, Dartmouth College,
Computer Science, Hanover, NH, September 1996.
11] Thomas H. Cormen and David Kotz. Integrating theory and practice in parallel le systems.
In Proceedings of the 1993 DAGS/PC Symposium, pages 64{74, Hanover, NH, June 1993.
Dartmouth Institute for Advanced Graduate Studies.
12] F. Dehne, W. Dittrich, and D. Hutchinson. Ecient external memory algorithms by simulating
coarse-grained parallel algorithms. Proc. ACM SPAA, pages 106{115, 1997.
13] R. W. Floyd. Permuting information in idealized two-level storage. In Complexity of Computer
Calculations, pages 105{109, 1972. R. Miller and J. Thatcher, Eds. Plenum, New York.
14] G. A. Gibson, J. S. Vitter, and J. Wilkes. Report of the working group on storage I/O issues
in large-scale computing. ACM Computing Surveys, 28(4), December 1996.
15] M. T. Goodrich, J.-J. Tsay, D. E. Vengro, and J. S. Vitter. External-memory computational
geometry. In FOCS, pages 714{723, 1993.
16] J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In STOC, pages
326{333, 1981.
17] S. Huddleston and K. Mehlhorn. A new data structure for representing sorted lists. Acta
Informatica, 17:157{184, 1982.
18] K. Mehlhorn and S. Naher. LEDA: A platform for combinatorial and geometric computing.
CACM, 38:96{102, 1995.
19] M. H. Nodine, M. T. Goodrich, and J. S. Vitter. Blocking for external graph searching. Algorithmica, 16(2):181{214, 1996.
102
20] M. H. Nodine and J. S. Vitter. Large-scale sorting in parallel memories. In ACM SPAA, pages
29{39, 1991.
21] M. H. Nodine and J. S. Vitter. Deterministic distribution sort in shared and distributed memory
multiprocessors. In ACM SPAA, pages 120{129, 1993.
22] J.F. Sibeyn and M. Kaufmann. Bsp-like external-memory computation. In Proc. 3rd Italian
Conference on Algorithms and Complexity, 1997.
23] M. Smid. Dynamic Data Structures on Multiple Storage Media. PhD thesis, University of
Amsterdam, 1989.
24] D. E. Vengro. A transparent parallel I/O environment. In Proc. 1994 DAGS Symposium on
Parallel Computation, 1994.
25] D. E. Vengro. TPIE User Manual and Reference. Duke University, 1995.
26] D. E. Vengro and J. S. Vitter. Supporting I/O-ecient scientic computation in TPIE. In
Proc. IEEE Symp. on Parallel and Distributed Computing, 1995.
27] J. S. Vitter and M. H. Nodine. Large-scale sorting in uniform memory hierarchies. Journal of
Parallel and Distributed Computing, 17:107{114, 1993.
28] J. S. Vitter and E. A. M. Shriver. Algorithms for parallel memory, I: Two-level memories.
Algorithmica, 12(2{3):110{147, 1994.
103