A-Tree: Distributed Indexing of Multidimensional Data For Cloud Computing Environments
A-Tree: Distributed Indexing of Multidimensional Data For Cloud Computing Environments
A-Tree: Distributed Indexing of Multidimensional Data For Cloud Computing Environments
Dimitrios Katsaros
AbstractEfficient
querying
of
huge
volumes
of
multidimensional data stored in cloud computing systems has
become a necessity, due to the widespread of cloud storage
facilities. With clouds getting larger and available data growing
larger and larger it is mandatory to develop fast, scalable and
efficient indexing schemes. In this paper, we propose the A-tree, a
distributed indexing scheme for multidimensional data capable of
handling both point and range queries, appropriate for cloud
computing environments. A performance evaluation of the A-tree
against the state-of-the-art competitor attests its superiority,
achieving significantly lower latencies.
Distributed index, multidimensional data, point query, range
query, query processing, cloud computing.
I.
INTRODUCTION
407
II.
RELATED WORK
III.
REQUEST-RESPONSE FRAMEWORK
408
and after executing it, the result set is empty, then it is said that
a false positive occurred. For a cloud index to be successful,
false positives should be as low as possible, because this also
implies small latencies.
In this framework (Figure 3), the task of query processing
on a master node can be divided into two phases: a) locate
relevant nodes, and b) forward the query to these nodes.
Relevant nodes will process the received query locally, and
will return the results directly to the user, avoiding thus
unnecessary resource consumption on master nodes.
Every slave node is responsible to share any data updates
with all master nodes which fully cover the data located at this
particular slave node. An update is sent after a node starts up
and according to the changes that will be done partial or
full updates will be exchanged. In case of deletions and
insertions, it is mandatory that master nodes must be informed
about them. Another important aspect, which is beyond the
scope of this paper, is load balancing. Load balancing must be
given attention too, and there are already some proposals such
as PASSION [15], which can be easily adopted to raise the
performance of the described framework.
Our indexing scheme, A-tree, is composed of an R-tree and
a Bloom filter on each slave node and an array of updates on
each master node (which is different than the approach of
EEMINC), as shown in figure 3. In order to introduce how
master nodes handle requests we must first introduce how slave
nodes handle data and how updates are sent and maintained.
IV.
409
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
C. Partial Updates
The aforementioned two algorithms describe how to select
nodes for distribution. Upon insertion or deletion, a change on
a local R-tree may occur. To deal with these changes we use
partial updates.
Upon a deletion request we delete appropriate records from
the tree and also keep a counter of deleted records. When
counter reaches a threshold we update Bloom filter and invoke
algorithm 2 to send a new update. It is recommended that the
threshold is set dynamically according to work load on the
system. To deal with insertions each node keeps an extra hyper
bounding box. If newly inserted record is not cover by the
previously sent update, the extra bounding box is expanded to
cover it. As partial update this extra bounding box is sent with
the Bloom filter if it changed. Because the volume of this extra
bounding box may get very big we set a threshold. If the
threshold is reached Algorithm 2 is invoked to send a new
update and reduce the number of false positives.
V.
1.
2.
3.
4.
5.
6.
7.
8.
9.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
410
links to the master nodes router with speed 100 Mbps. Each
user link to the router is 100 Mbps. MTU was set to 1500
bytes for all links. Network topology is shown on figure 4. A
user sends requests to a specific master node, defined at
initialization state, and receives results from any slave nodes.
After the result for the previous request returns then another
request is sent until the predefined number of requests is
reached. The latency for each request is recorded and when all
users finish the last one calculates the total average latency. All
master nodes receive request from equal number of users. Each
slave node also keeps two counters for false positives for point
and range queries.
Experiments were run on a computer system equipped with
four Quad-Core AMD Opteron(tm) Processor 8350, 16
gigabytes ram and 320 gigabytes storage capacity. Operation
system was SUSE Linux Enterprise Server 10 (x86_64)
running kernel version 2.6.16.21-0.8-smp. All experiments ran
five times and average is used for plotting the graphs.
In the first experiment, the latency for point and range
queries is measured as the number of nodes in the system
increases. The total average of false positives is also measured
as it is very important for minimizing network usage, and
resulting in faster query processing if data set does not contain
any response results. Fifteen users are connected to the system
and each one executes one hundred point queries and one
hundred range queries. There are five master nodes, each one
responsible for three users. Each slave node is responsible for
ten thousand records meaning that the size of the collection
increases as the number of nodes increases. Due to the use of
Bloom filters, point queries are forwarded to slave nodes with a
very small probability of false positive, resulting in a dramatic
decrease of resource consumption and lower average latency
for point queries. Unfortunately for range queries Bloom filters
cannot be used, but still the A-tree performs faster especially
for large number of slave nodes showing that the proposed
algorithm for distributing R-tree nodes is efficient. Figure 5(a)
and figure 5(b) show the average latency for point and range
queries respectively revealing that A-tree can handle point and
range queries efficiently with very low latency, especially for
point queries.
In order to evaluate how the number of master nodes affects
queries latency, the same experiment was repeated five more
times with the difference that ten master nodes exist now in the
system. Results are shown in figure 5(c) and figure 5(d). Firstly
we notice that latency is lower for A-tree in contrast to
EEMINC where latency is higher when more master nodes
exist in the system. Furthermore we notice that for small
number of nodes with ten master nodes, latency for range
queries is almost the same with EEMINC but as the system
gets larger A-tree performs better.
Concerning point queries, EEMINC performs worst when
more nodes exist in the system, in contrast to A-tree where a
little performance is gained. Table I shows the total average
number of false positives for range and point queries. False
positives for range queries are very close but the difference for
point queries is very high because of the use of Bloom filters.
Although the difference of false positives rates for range
queries is not high, the latency is getting higher for more than
one hundred nodes in the system. Even few less false positives
can have a huge impact on large systems where possible disk
PERFORMANCE EVALUATION
411
(a)
(b)
(c)
(d)
Figure 5. Average point and range queries latency vs. #nodes: (a) average point queries latency with five master nodes (b) average range queries latency with
five master nodes (c) average point queries latency with ten master nodes (d) average range queries latency with ten master nodes
range queries, the number of slave nodes was set to fifty and
data size was set from fifteen thousands records to seventy five
thousands records per slave node, resulting to total 3750000
records.
We notice that there is a small increase in latency for A-tree
as the collection is getting bigger due to high number of false
positives, especially for point queries. This is due to the fact
that the Bloom filter, which was set to default size of 30.5 Kb,
is getting full and false positives probability is getting higher.
The average latency for range and point queries is shown on
figure 6. A-tree is still capable of answering both point and
range queries efficiently and scales as the system and data set
are getting larger. Despite the fact that false positives rates, for
both point and range queries, are lower in A-tree than in
EEMINC, this small increasment for point queries along with
the fact that data set is larger on every nodes are the
extenuation for the small increasement of latency. This small
increase of latency can be easily avoided by using bigger
Bloom filters and setting the maximum number of bounding
boxes in updates higher.
Comparing the results from previous experiments, we
notice that we have better performance when less nodes are
used with more records on each of them. This behaviour is
somehow expceted because when more nodes with less records
exist, the whole system is not fully utilized. As far as each node
is not overloaded, it is shown that adding more records to the
node will lower latencies and increase performance of the
whole system.
Range queries
Point queries
A-Tree
EEMINC
A-Tree
EEMINC
100
49813
49707
49943
200
74809
74972
74894
300
99861
99958
10
99931
400
124702
124830
11
124484
500
149583
149716
13
149836
412
Point queries
#Records/
Node
A-tree
EEMINC
A-tree
EEMINC
15000
37325
37339
199
37434
30000
37189
37212
707
37500
45000
37049
37123
1509
37500
60000
36911
36918
2516
37481
75000
36791
36803
3577
37499
(a)
(b)
Figure 7. Average queries latency as number of users increases. (a) with five master nodes (b) with ten master nodes
413
(a)
(b)
Figure 8. Average queries latency for different network speeds with ten master nodes: (a) average latency for point queries (b) average latency for range queries.
[4]
[5]
[6]
[7]
[8]
Figure 9. Average queries latency as users increase for different networks.
[9]
[10]
[11]
[12]
[13]
[14]
ACKNOWLEDGEMENT
[15]
[16]
[17]
REFERENCES
[1]
[2]
[3]
[18]
[19]
414