Exploiting Subspace Distance Equalities in Highdimensional Data For KNN Queries

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Karlsruhe Reports in Informatics 2018,6

Edited by Karlsruhe Institute of Technology,


Faculty of Informatics
ISSN 2190-4782

Exploiting subspace distance equalities in


Highdimensional data for knn queries

Martin Schäler, David Broneske, Veit Köppen,


and Gunter Saake

2018

KIT – University of the State of Baden-Wuerttemberg and National


Research Center of the Helmholtz Association
Please note:
This Report has been published on the Internet under the following
Creative Commons License:
http://creativecommons.org/licenses/by-nc-nd/4.0/de.
Exploiting sub-space distance equalities in
high-dimensional data for knn queries

Martin Schäler1 , David Broneske2 , Veit Köppen2 , and Gunter Saake2


1
Karlsruhe Institute of Technology, 2 OVGU Magdeburg
1
[email protected], 2 [email protected]

ABSTRACT an approach that is expected to result in good response times for


Efficient k-nearest neighbor computation for high-dimensional data knn computation regarding various distributions instead of reaching
is an important, yet challenging task. The response times of state- peak performance for some data sets (cf. Figure 1).
of-the-art indexing approaches highly depend on factors like distri-
bution of the data. For clustered data, such approaches are several Idealized knn
factors faster than a sequential scan. However, if various dimensions response time

contain uniform or Gaussian data they tend to be clearly outper- Pivot-based


formed by a simple sequential scan. Hence, we require for an approaches

slower
approach generally delivering good response times, independent of Sequential
the data distribution. As solution, we propose to exploit a novel con- scan
cept to efficiently compute nearest neighbors. We name it sub-space
distance equality, which aims at reducing the number of distance
computations independent of the data distribution. We integrate knn
Elf approach
computing algorithms into the Elf index structure allowing to study faster
Application field
the sub-space distance equality concept in isolation and in combi- for sub-space
nation with a main-memory optimized storage layout. In a large distance equality
Stochastic
comparative study with twelve data sets, our results indicate that Dense distribution
Mixed Totally
indexes based on sub-space distance equalities compute the least clusters only Distribution uniform
amount of distances. For clustered data, our Elf knn algorithm de- Figure 1: Intuition of the application field of indexes based on
livers at least a performance increase of factor two up to an increase sub.space distance equalities
of two magnitudes without losing the performance gain compared
to sequential scans for uniform or Gaussian data.
The objective of this paper is to investigate how to design an
index that generally results in good performance. That is, for highly
1. INTRODUCTION clustered data we want to achieve, on average, at least comparable
In data analysis, efficient computation of the k-nearest neighbors performance to state-of-the-art indexes. In case the data set contains
(knn) for high-dimensional data sets has long been an important, yet uniform or Gaussian data, the average performance shall be at least
challenging problem [6, 11]. With emerging applications, like sci- comparable to a sequential scan. We explicitly do not target at
entific databases or time series analytics, this importance is further outperforming any known approach for any possibly existing data
increasing. The reason is that solving the knn problem is part of var- set. We deem such endeavor unreasonable considering decades
ious data analysis methods, including classification and clustering. of knn index research and the variety of parameters to consider.
To allow for efficient knn computation, a wide range of indexing Examples for such parameters are different dimensionality, number
techniques is known. Early approaches, such as R-Trees [15] or of points, and stochastic distribution of the data sets as well as
kd-Trees [3] are known to deliver good performance only in low- different knn distance functions, to name only a few.
dimensional spaces [20]. State-of-the-art approaches [8, 16], such as A novel concept that gives way to approaches generally resulting
iDistance [17], map each point in the data set to its nearest pivot(s). in good response times is sub-space distance equality. It is based on
Comparative studies [8, 17] report large performance increases. the observation that all points sharing the same prefix (combination
However, usually those approaches are difficult to tune as they fea- of values) have the same distance to any query w.r.t. the sub space
ture various parameters and are known to be parameter sensitive [23, defined by the prefix. Hence, if this concept is well exploited, the
27]. More importantly, distances between high-dimensional points number of required distance computations is highly reduced. That is,
tend to be very similar – particularly if the number of dimensions distance computation is executed only once per sub space (not once
increases. Therefore, the mapping of points to artificial clusters does per point) and the sub-space distance represents a tight lower bound
not solve the fundamental problem of high-dimensional data sets: for any point within that sub space. To our knowledge, the only
often a large fraction of the data set is visited upon knn computation. index that allows exploiting sub-space distance equalities is Elf [7].
With the increasing amount of main-memory, a major cost driver Moreover, this index features a main-memory optimized storage
for sequential scans, fetching all points from hard disk, has vanished. layout, which allows to investigate the effect of sub-space distance
This makes sequential scans as competitor even more powerful. equality on knn computation in isolation as well as in combination
Nevertheless, advances in hardware alone do not allow for efficient with an optimized storage layout. Therefore, we select Elf for our
analyzes of the fast growing amount of data. Hence, we require for investigations.
However, so far, Elf is only known to deliver high speedups for as the Euclidean (p=2) and Manhattan metric (p=1), are frequently
multi-column selection predicates (i.e., multi-dimensional range used. The general definition is given in Equation 1. Metrics have
queries) [7]. This query type is fundamentally different to knn special algebraic requirements (e.g., the triangle inequality must
queries. Hence, we make non-trivial extensions in order to open hold) used by state-of-the-art indexes, such as iDistance [17]. There-
this new application field for indexes based on the concept of sub- fore, we restrict the considered distance functions to metrics, even
space distance equality. Altogether, our investigations result in the as this is no special requirement for exploiting our concept.
following contributions:
2.2 Related work
1. We develop algorithms exploiting sub-space distance equali-
ties for knn computation and integrate them into the Elf index. There is a large variety of different approaches. Therefore, we
Thereby, we exploit the index’ genuine features at concept level do not focus on specific approaches, but on underlying concepts
as well as the level of optimized storage layout. and name well-known examples for each concept. For specific
approaches, we refer to large-scale comparisons cited in each group.
2. We comprehensively examine factors affecting the performance
of our knn algorithms allowing to build optimized Elf indexes Classic indexing techniques. Approaches like R-Tree [15]
for arbitrary combinations of data set and distance function. or kd-Tree [3] have given way to various improvements [4, 9, 19].
3. We comprehensively investigate the performance of our algo- There are large studies or surveys addressing this topic comprehen-
rithms. To this end, we consider twelve data sets with different sively, like [6, 11, 20]. All of these approaches use geometric forms
properties, such as data size and stochastic distribution, reveal- defined on the full-space, which is the primary reason why they
ing that our algorithms result in competitive performance for all tend to degenerate to a sequential scan for high-dimensional data.
data sets considering highly potent competitors. The reason is Nevertheless, to show differences, we include one approach from
that exploiting sub-space distance equalities results in the least this group in our comparative studies. The number of dimensions
amount of distance computations for all data sets. they can efficiently support varies, but usually is deemed between
10 and 20 [6, 26].
4. We reveal that the observed speedups hold for a well-known
As solution, optimized sequential searches are proposed, such as
instance of the Minkowski metric family.
the VA-File [26]. Their core idea is using a compressed representa-
Finally, to support repeatability, all data sets and implementations tion of the data that fits into main-memory allowing to filter points
are available open source1 . efficiently. The data itself resides on hard disk being multiple orders
of magnitudes slower than main memory. This difference is known
2. PRELIMINARIES AND RELATED WORK as access gap. However, with increased main-memory capacities,
usually the whole data set fits into main memory. Conceptually,
In this section, we introduce the knn problem including properties
one could transfer this idea, such that data resides in main memory
of distance functions. Furthermore, we introduce related work.
and the VA-file in a higher cache level (i.e., L3 cache). However,
2.1 The knn problem the speed difference between L3 cache and main memory usually
does not exceed one order of magnitude, i.e., the access gap is much
Given a set of points in a d-dimensional space (Nd ), a knn query smaller and its exploitation therefore less promising. To this end,
returns the k closest points to the query point according to some dis- we do not consider such concepts.
tance function. Formally, a query knn(q,dist(),D,k) has the
following input: a query point q ∈ Nd , a distance function dist(),
a set of points D all being ∈ Nd , and some k ∈ N with k > 0. It
Metric indexing with pivots. The AESA approach introduced
the idea of pivot-based indexing [18]. Generally, the idea is to deter-
returns a set S containing k points from D, where max_dist is
mine several pivot points. Then, one maps any point in the data set
the largest distance of any point p ∈ S. The following holds for any
to its nearest pivots. Due to these mappings the triangle in-equality
point p’ in D − S: max_dist ≤ dist(p’,q).
between query point, pivot, and data point can be used for com-
We map all dimensions in the d-dimensional space (Nd ) to integer
puting a lower bound of the distance between query point and data
numbers in interval of [0,c], where c is a user-defined granularity,
point using any metric [8, 25]. Thus, lower bound computation is
instead of the unit space Rd [0,1]. The reason is that many dist()
a simple addition and subtraction of pre-computed values. Various
functions, such as the Manhattan metric, can be computed using
approaches rely on this principle. The most comprehensive com-
integer arithmetic known to be generally faster than float arithmetic.
parison is [16]. A particularly relevant approach is iDistance [17],
In addition, this definition of the knn problem implies that one does
assigning one-dimensional indexes to the pivot-point mappings. To
not necessarily has to use dist(). There may be some function
this end, it allows to efficiently traverse promising candidates and
dist’() that is order preserving according to dist(), but easier
prune not relevant ones. Based on the one-dimensional index, it can
to compute. A common example is using the squared Euclidean
be used for data on hard disk or in main memory [23, 27]. We are
distance to avoid computing square roots.
particularly interested in revealing differences between approaches
relying on pivots and approaches relying on sub-space distance
!1
d
X p equalities. To this end, besides response time, we measure the num-
p
dist(X,Q) = |xi − qi | with p ≥ 1 (1) ber of points for that the index computes the exact distance, which is
i=1 also independent of the hardware and programming language used.

Metrics as distance functions. Different distance functions Approximate approaches. There are various approximate ap-
support different intuitions of distance. Metrics are a group of proaches that deliver an approximation of the correct k-nearest
distance functions that are well studied and used frequently. Partic- neighbor satisfying certain bounds, such as [1]. A large group of
ularly, the Minkowski metrics also referred to as Lp metrics, such approaches is locality sensitive hashing (LSH) [10, 13]. We do not
consider this group, since our objective is to find the correct nearest
1
http://www.elf.ovgu.de/knn neighbors.
3. KNN COMPUTATION WITH SUB-SPACE points cannot be part of the query answer. To optimally exploit
DISTANCE EQUALITIES this effect, one needs to provoke large sub-space distances, also by
selecting a respective dimension order.
In this section, we first introduce the concept of sub-space dis-
tance equalities and explain two effects allowing to exploit it for
efficient knn computation. Moreover, we briefly describe Elf as
Unknown characteristics of both effects. Considering both
an index featuring sub-space distance equalities. Finally, we intro- effects, we state it is not intuitively clear how to find a dimension
duce two knn computing algorithms aiming at exploiting sub-space order that optimizes either effect. To this end, we investigate how
distance equalities in Elf. With the help of these algorithms, we to select the dimension order, which we name the dimension order
selection problem in Section 4.1. To examine this problem, we
examine how to optimally exploit sub-space distance equalities for
knn computation. design knn computing algorithms exploiting both effects.

3.1 Sub-space distance equalities Existing iterative metrics. We find various metrics whose
To explain the concept of sub-space distance equality and its computation is iterative, such as the Manhattan, Chebychev, or
exploitation for knn computation, assume some n-dimensional data Hamming distance. However, generally the class of Lp metrics,
set, with n > 2 and sort it to some dimension order. That is, we such as the Euclidean metric, is not iterative. Nevertheless, due
do not only sort according to one dimension, but according to all. to the definition of the knn problem (cf. Section 2.1), we do not
For simplicity of explanation, let us assume that we sort according require for the actual distance values, but only the correct k-nearest
to dim1 , . . . , dimn . As a result, we observe that all points having neighbors (in the right order). Hence, we can use a replacement
the same value in the first dimension (dim1 ) are found next to each distance function dist’() that is order preserving w.r.t. dist().
other. However, that does not only hold for points having the same For instance, we can use the squared Euclidean distance, which is
value in the first dimension, but for all points sharing the same prefix. iterative and requires less computational effort. As this is possible
A prefix preu refers to the first u dimensions, i.e., it projects the n- for every Lp metric, this allows us to benefit from the concept
dimensional point to a u-dimensional one. It is important to note that of sub-space distance equality for various, well-known metrics.
all points sharing the same prefix preu are represented by the same Note, knn computation is possible for non-iterative distances as
point in the corresponding u-dimensional sub space. Consequently, well. In such cases, we need to compute the distance considering
they also have the same distance to any possible query point in that the whole sub space instead of simply adding dist(p[n],q[n])
sub space. For illustration, consider the (ordered) example data set per dimension.
Figure 2b. We observe that points T1 and T2 have the same prefix
until dimension dim1 . Consequently, in the one-dimensional sub 3.1.2 Distinction of sub-space distance equalities to
space consisting only of the first dimension, the distance to any other approaches
query point q is dist(q[1], T1 [1]) = dist (q[1] = T2 [1]), where Relying on sub-space distance equalities for knn computation is
q[1] refers to the value of q in the first dimension. This is what we fundamentally different from existing approaches. For illustration of
call a sub-space distance equality. the difference, let us consider R-trees and their derivatives, like R+ -
tree [24], R∗ -tree [2], M-tree [9], X-tree [4], using n-dimensional
3.1.1 Two effects for knn computation geometrical objects to index sub spaces. Knn computation with these
Sub-space distance equality features two effects that can be ex- indexes has two problems: 1) expensive distance re-computation
ploited for efficient knn computation. The first effect is re-use of and 2) considerably large tree depths. First, descending the tree,
sub-space distances. One can compute sub-space distances for each the geometrical objects used for indexing the space shrink in every
set of points having the same prefix only once, instead of computing dimension. This means that also their whole n-dimensional distance
it for each point. The second effect is, one can use the sub-space to the query point has to be recomputed in each descending step.
distance in a u-dimensional sub space as lower bound for the full- Second, the depth of the tree is not limited to a specific depth but is
space distance. We now outline both effects and how iteratively restricted to the number of points and fan out. Both points lead to a
computable metrics allow exploiting these effects. Finally, we give hardly definable computational complexity converging to visiting
examples which metrics are iteratively computable. all tree nodes as dimensionality increases [14, 5].
A metric is iteratively computable iff, given two arbitrary n- In contrast, this is different when exploiting sub-space distance
dimensional points p and q, the distance dist(p,q) can be com- equalities, as we iterate along dimensions refining the currently
puted as dist(p[1],q[1]) ⊕ dist(p[2],q[2]) ⊕ . . . ⊕ found distance. Hence, the previously computed sub-space distance
dist(p[n],q[n]). does not change and can be reused for the currently investigated
Re-use of sub-space distances. In case ⊕ is associative, we can dimension.
re-use the sub-space distance as follows: we compute the l − 1
dimensional sub-pace distance for all points P with the same prefix 3.2 Elf as MCSP index
prel−1 . Then, we split P , such that we group all points having To our knowledge, Elf is the only index that features sub-space
the same prefix prel . Next, to compute the sub-space distance for distance equalities and allows to study the effect of sub-space dis-
each newly created group, one only needs to add their distance in l, tance equalities in isolation and in combination with a main-memory
which is dist(p[l],q[l]), as all points in such a group have optimized storage layout. Therefore, it is the starting point for our
the same value in dimension l. To optimally exploit this effect, own investigations. Originally, Elf is proposed to efficiently evaluate
one needs to maximize the number of sub-space distance equalities, multi-column selection predicates named MCSPs. MCSPs are pred-
which can be done by selecting a respective dimension order. icates that include several (but not necessarily all) attributes of one
Lower bound. One can exploit this effect, if dist(p[n],q[n]) table [7]. Thus, it is designed for a different purpose, and we have
is positive, i.e., ≥ 0. As a result, any sub-space distance is a lower to extend that approach with knn-computing capability exploiting
bound for the full-space distance. Consequently, in case the sub- sub-space distance equalities as defined in Section 3.1. To this end,
space distance of a set of points exceeds or is equal to max_dist we introduce its conceptual design and memory layout. Based on
(the distance of the currently found kth nearest neighbor), those that, we develop knn computing algorithms in Section 3.3.
D1 D2 D3 D4 TID present the linearized Elf from Figure 2b. The first linearized
0 1 0 1 T1 DimensionList is a hash map storing pointers to the underlying
0 2 0 0 T2 DimensionLists. The second DimensionList stores values
1 0 1 0 T3 and pointers of the DimensionList (2), where values in brackets
represent pointers within the array. To minimize storage consump-
(a) Running example data tion, one does not store length information of a DimensionList.
0 1
Instead, the most significant bit (MSB) of the last value denotes
(1)
Dimension D1 the end of a DimensionList. We visualize this with a negative
value. For example, the −2 at offset 4 denotes the last entry of
(2) 1 2 (3) 0 1 0 T3 DimensionList (2). Similarly, a negative pointer indicates that
Dimension D2
the following DimensionList is a MonoList.
Dimension D3
0 1 T1 0 0 T2 +
(4) (5) Dimension D4
For our own investigations, the optimized storage layout offers
(b) Conceptual Elf two interesting examinations possibilities. First, we can investigate
0 1 2 3 4 5 6 7 8 9 whether different linearizations yield different performance, and sec-
(1) (2) (4) (5) ond quantify the effect of the best linearization strategy comparing
ELF[00] [02] -[12] 1 -[6] -2 -[9] 0 1 T1 0
(3)
it to its unoptimized version.
ELF[10] 0 T2 0 1 0 T3
3.3 Knn algorithms for Elf
(c) Memory layout of Elf We now introduce two knn algorithms exploiting sub-space dis-
Figure 2: Example data, conceptual Elf, and Elf memory layout tance equalities within Elf. The algorithms are optimized for differ-
ent properties of the data. The first algorithm aims at minimizing
the number of distance computations. To this end, it aims at fast
3.2.1 Conceptual design converging towards the final distance of the kth nearest neighbor
We illustrate the design of Elf with the data set in Figure 2a. Each allowing to prune entire sub trees within Elf. By contrast, the sec-
tuple in the data set has 4 dimensions and a TID to uniquely identify ond algorithm aims at data sets that deteriorate towards scanning
a tuple. Conceptually, Elf incrementally indexes existing values in an entire Elf optimally exploiting the data layout. The results in
sub spaces. That is, the first level in Elf (i.e., the root level) refers Section 4.1.4 allow selecting the best algorithm for a given data set.
only to the first dimension. The second level refers to the first and
the second dimension. Hence, it indexes a two-dimensional sub
1 KnnConverge ( q , k , dim , d i m L i s t , s u b S p a c e D i s t )
space of the whole data set. The third level therefore refers to a 2 f o r ( e a c h elem i n d i m L i s t ) / / f i n d b e s t match
three-dimensional sub space etc. The order of dimensions thereby 3 i f ( d i s t ( elem . v , q [ dim])−> m i n i m a l ) b r e a k
is the only parameter. Each node of Elf, called DimensionList, 4 n e w D i s t <− s u b S p a c e D i s t + d i s t ( q [ dim ] , elem . v )
5 i f ( newDist >=m a x d i s t ) r e t u r n / / p r u n e a l l
contains entries of the form [Value, Pointer] ordered according to 6 i f ( i s M o n o L i s t ( elem . p ) )
Value. The first level of the tree consists of one root node that 7 knnElfMono ( q , k , dim +1 , elem . p , n e w D i s t )
contains every unique value of the first dimension – for instance, the 8 else
9 KnnConverge ( q , k , dim +1 , elem . p , n e w D i s t )
values 0 and 1 of D1 in the Elf in Figure 2b. Each entry is the start 10 e l e m R i g h t<−elem ; e l e m L e f t<−elem
of one path. 11 w h i l e ( ! done ){ / / s e a r c h i n− and o u t w a r d s
All tuples having the same value in the first dimension are in 12 e l e m R i g h t<−e l e m R i g h t . n e x t
the same path. More generally, all tuples having the same prefix 13 newDist<−s u b S p a c e D i s t + d i s t ( q [ dim ] , e l e m R i g h t . v )
14 i f ( newDist >=m a x d i s t | | e l e m R i g h t . l a s t )
are found in the same path. For example, all tuples represented by 15 d o n e R i g h t<−t r u e
DimensionList (5) have the prefix 0 in the first dimension and 1 16 else
in the second dimension. Constructing the tree in this manner spans 17 / / r e c u r s i v e c a l l l i k e i n L i n e 6−9
18 / / s i m i l a r for elemLeft
a tree with d levels where d corresponds to the number of indexed 19 i f ( d o n e R i g h t&d o n e L e f t ) done <− t r u e
dimensions. However, due to the curse of dimensionality [5], the 20 }
more dimensions are indexed, the more sparsely populated (i.e., 21
22 KnnOptLayout ( q , k , dim , d i m L i s t , s u b S p a c e D i s t )
shorter) are the DimensionLists in deeper levels. The worst 23 elem <− d i m L i s t . f i r s t
case is that we have a linked-list like tree, which Elf counters with 24 do{
the concept of MonoLists [7]. 25 newDist<−s u b S p a c e D i s t + d i s t ( q [ dim ] , elem . v )
26 i f ( newDist >=m a x d i s t ) c o n t i n u e / / p r u n e
Whenever one encounters a tree level t in some path such that 27 else
there only remains one tuple in it (i.e., there is no fanout), the 28 i f ( i s M o n o L i s t ( elem . p ) )
build algorithm creates a MonoList. That is, from level t to d 29 knnElfMono ( q , k , dim +1 , elem . p , n e w D i s t )
30 else
(where t < d) Elf stores all values adjacent to each other. In 31 KnnOptLayout ( q , k , dim +1 , elem . p , n e w D i s t )
Figure 2b, DimensionLists (3), (4), and (5) are MonoLists, 32 } w h i l e ( ! elem . l a s t ) / / r e a c h e d end}
because there is no fanout since each of these DimensionLists
correspond to only one tuple. The benefits of MonoLists are that Figure 3: Elf knn algorithms with different optimizations
they reduce storage costs and improve data locality [7].

3.2.2 Optimized memory layout 3.3.1 Knn algorithm optimizing pruning power.
It is possible to build Elf with a main-memory optimized stor- The first knn algorithm KnnConverge in Figure 3 aims at
age layout as shown in Figure 2c. To this end, one linearizes traversing Elf such that max_dist converges fast against the dis-
Elf into an array, using a preorder traversal. In Figure 2c, we tance of the kth nearest neighbors. This optimizes Elf’s ability
to prune sub trees that do not contain query solutions. To this 4.1 Dimension-order selection problem
end, an Elf is traversed in a greedy manner starting by invoking The dimension order is the only parameter of the Elf build algo-
KnnConverge for the first dimension list. In each list, the algo- rithm. From [7, 22], we know how to construct an optimized di-
rithm searches for the best sub tree (best match), which has mini- mension order for MCSP-evaluation and predict Elf’s performance.
mum distance to the query point q. The algorithm first examines Unfortunately, the situation for knn queries is different. For knn
the corresponding sub tree of that element. Then, it examines the queries, we have to descent all paths as long as the remaining sub
remaining dimension elements in the current dimension list itera- tree can contain at least one query solution. Moreover, it is unknown
tively. That is, in the first iteration one element to the left and one whether one needs to optimize Elf for the re-using sub-space equali-
to the right are examined. The sub trees are only examined in case ties effect (i.e., maximizing the number of prefix redundancies) or
their sub-space distance is smaller than the (full-space) distance of the lower bound effect that allows pruning. The latter one means
the current kth nearest neighbor being max_dist. The algorithm maximizing the sub-space contrast generally minimizing the number
stops, in case there are no more solutions to the left and right or in of prefix redundancies. Thus, both targets are in contrast to each
case the best match already does not contain a query answer (Line other. Our goal is to find an approach that allows to determine a good
5). Note, knnElfMono computes the distance for the remaining dimension order in a reasonable amount of time. What makes this
dimension of one point. investigation particularly challenging is that the number of possible
dimension orders grows exponentially.
3.3.2 Knn algorithm optimizing data locality. Our investigation procedure is as follows. First, we examine one
The second knn algorithm is optimized for data sets, where prun- data set in detail and then generalize results. To this end, we take the
ing, for instance due to the curse of dimensionality, is difficult. Dr16 real-world data set (cf. Table 1) and create a new data set for
To this end, this algorithm does not search for the best match in that we can sample all possible (n!) dimension orders and construct
each dimension list. It strictly follows the data layout of Elf and the respective Elf. We call this data set micro benchmark data set.
thus, optimizes data locality. Therefore, it starts at the first element In addition, we restrict the empirical investigation to the Euclidean
of each dimension list, examines its corresponding sub tree, and metric, which is often used for such investigations.
then, continues with the next element. As a result, this algorithm
is more similar to a sequential scan which benefits primarily from 4.1.1 Influence of the dimension order
data locality. However, we still benefit from sub-space distance The first set of experiments aims at studying the problem com-
equalities reducing the number of distance computations and we prehensively. To this end, we answer the question: Does the order
also check whether we can prune. Note, traversing Elf results in matter, quantify the difference of a good, normal, and bad dimension
some overhead that a sequential scan does not have. Particularly, order, and the probability of finding either class by chance.
for high-dimensional uniform and Gaussian data, where sequential
scans outperform any known indexing approach. So far, we hypoth- 80 µ = 516, σ = 100, N = 363
esize that due to this algorithm Elf is more robust towards the curse
Dimension Orders

of dimensionality than other indexing approaches and examine this 60


Amount of

in Section 5.
40
3.4 Research questions
20
Since exploiting sub-space distance equalities is a novel concept,
we aim at revealing in-depth insights by means of systematic experi-
0
ments in the following section. In particular, we firstly investigate 300 400 500 600 700 800 900
how both effects from Section 3.1 using different dimension orders
Response Time Range in ms
affect knn computation and secondly quantify the influence of the
optimized storage layout. Considering Elf as foundation for our a) Response time histogram of all sampled dimension orders
investigation results in the following research questions (RQ):
800
Response Time in ms

1. RQ1: How to optimally exploit the two effects, sub-space dis-


tance equalities induce, using a respective dimension order within 600
Elf?
400
2. RQ2: How to quantify the effect of Elf’s optimized data layout
and its interaction with an appropriate knn algorithm? 200

The results of these micro benchmarks are used in Section 5 for 0


0 50 100 150 200 250 300 350
our comparative studies considering different data sets, distributions,
and Euclidean metric. Sample Permutation ID
b) Pattern of the response time distribution
4. PERFORMANCE FACTOR TUNING Figure 4: Results of the sampling method.
In this section, we examine the factors that are relevant for Elf’s
knn performance. In Section 3.4, the two research questions directly
refer to performance factors: RQ1 directly translates to the question Sampling method. For our micro benchmark data set having
of how to order the dimensions in Elf. RQ2 aims at finding an nine dimensions, testing all 9! dimension order permutations and
optimal storage layout and quantifying the effects of the storage determining the response times for retrieving all k nearest neigh-
layout in general. In the following, we conduct a set of micro bors for all 16k points in a statistical sound manner is not possible
benchmarks, to answer these RQs. The results allow to build an (approx. 126 computing days). Hence, we sample the space of the
optimized Elf for any data set. existing permutations taking every 1,000th permutation resulting
100 %
Dim 0
500

Response Time in ms
First permutation faster 80 % Dim 1
than from sampling Dim 2
Genetic Sampling
60 % Dim 3
Fastest 312 ms 317 ms 400 Dim 4
Avg Top 10 330 ms 335 ms 40 %
Slowest 870 ms 809 ms Dim 5
Avg Slowest 10 850 ms 751 ms Dim 6
20 %
300
Dim 7
1 2 3 4 5 6 7 8 9 10111213 0% Dim 8

Po 7

s9
Po 1
Po 2
Po 3
s4
Po 5
Po 6

s8
s
s
s
s

s
s

Po
Po

Po
Round
a) Method’s example results b) Results from genetic optimization c) Commonalities of fast dimension orders
Figure 5: Results from genetic optimization and how they relate to the dimension order.

in 363 samples. We depict a histogram of the sampled response the 8th or 2nd dimension must be first and the other one second. These
times in Figure 4a) revealing how the response times are distributed. are the same regions in the permutation space identified as having
By contrast, Figure 4b) visualizes whether there are regularities minimum response time being encircled in Figure 4. Interestingly,
in the permutation space regarding the response times. Based on inverting a fast dimension order results in a slow dimension order.
the results shown in Figure 4a), we conclude that the dimension The slowest ones we found are in fact inverse to the fastest ones.
order indeed has a significant influence on the response time. The
fastest order requires 317 ms, the average is about 516 ms, and the 4.1.2 Investigating the dominant performance factor
slowest ones more than 800 ms for finding all nearest neighbors Based on the insights of the prior method, we now analyze what
for all points. For comparison, the response time of a sequential is the factor in Elf to optimize to achieve good response times using
scan is 1,380 ms. Hence, guessing a dimension order definitely the dimension order. Therefore, we formulate hypothesis H1 and
results in a better performance than a sequential scan. However, as H2 , each expecting one performance factor to be the decisive one:
the response times are distributed in a skewed Gaussian manner,
probabilities are high not to select a fast order. The pattern of the Hypothesis H1 A high number of sub-space distance equalities is
distribution w.r.t. the linearized permutation space in Figure 4b) (i.e., the primary factor for good performance.
the order of the permutation and their corresponding response time),
reveals a mixture of local and global regularities (i.e., similar orders Hypothesis H2 The pruning power of Elf is the primary factor for
have similar response times). Hence, the dimension-order selection a good performance.
problem is a well-suited problem for genetic optimization.
To confirm or reject the hypotheses, we first compute one perfor-
Genetic method. Our goals with this method are twofold. First, mance indicator for each of them and then examine how they are
we intend to find a method that is applicable on larger data sets (i.e., related to fast and slow dimension orders.
particularly higher number of dimensions) that cannot be sampled
efficiently. Second, we want to explore the edges of the response H1 : Elf compression factor quantifying sub-space equal-
time distribution in more detail. In particular, we investigate com- ities. The first indicator, the Elf Compression Factor (ECF), quanti-
monalities of the fastest and slowest permutation in order to predict fies how many nodes we do not store compared to simple sequential
a good dimension order analytically. storage of the data set. Therefore, it counts the number of prefix re-
The results from the genetic method depicted in Figure 5, are dundancies and normalizes the result. We explain this indicator with
consistent with those of the sampling method. Hence, we claim that the help of the example data set D having four points: p1 =(1,2,3),
we studied the effect of the dimension-order selection problem on p2 =(1,2,2), p3 =(1,1,1), p4 =(2,1,1). First, we count every observed
the response time of Elf for this data set in a comprehensive way. prefix redundancy: There are three points (p1 , p2 , and p3 ) having the
Now, we investigate how to predict a good dimension order. In value 1 in the first dimension. Hence, we can eliminate two prefix
particular, the best-found dimension order has a response time of redundancies. In the second dimension, only the prefix (1,2) exists
312 ms, which is only 5 ms faster than the fastest permutation from twice. Hence, we observe one additional prefix-redundancy. Next,
the sampling method. By contrast, the slowest permutation found we compute the maximum number of dimension elements
requires 870 ms compared to 809 ms. used to normalize the result, which is equivalent to the worst case.
In Figure 5b), we show a box plot visualizing the response time In Elf that would occur in case there are no prefix redundancies at
distribution of the resulting population. After 9 rounds of the genetic all. Hence, the number is the product of the dimensionality of the
algorithm, the population already contains a dimension order that data set and the number of points, in this case: 12. As a result, for
yields faster response times than any order observed with sampling. D we get a value of 12 3
. Thus, a value close to zero indicates high
In round 13, the genetic method delivers a set of 50 dimension compression, while values close to 1.0 mean almost no compression.
orders, where the slowest order has a response time of 326 ms. We
analyze these permutations for commonalities to find a practical Rejection of ECF as dominant factor. Our experimental
method to predict a good dimension order. As starting point, we results reveal that the ECF of all 9! dimension orders are approxi-
visualize for these orders the frequency of occurring dimensions for mately uniformly distributed between the values [0.173, 0.257] with
each position in Figure 5c). From the figure, we see that all good an average of 0.208. The ECF values for the fastest 50 dimension
dimension orders start either with [8,2,...] or [2,8,...]. That is either orders found in the genetic optimization are in the interval [0.215,
0.240] with an average of 0.227. This interval covers 110,000 di- 4.1.3 Determining good dimension orders
mension orders, being roughly one third of all orders. Therefore, We revealed that the pruning power, indicated by large SSCpu
there only is some tendency for fast dimension orders to have an for small values of u, is the decisive performance factor. From
ECF rate that is higher than the average. However, even for the investigating the impact of the dimension order problem on the
highest ECF rate, there are more than 16,000 dimension orders that response time, we know there is a set of fast permutations with
have a higher ECF rate. For the slow dimension orders, we observe similar response time. Hence, it is reasonable not to aim at finding
an inverted tendency. That is, they generally have lower ECF than the best permutation, but a reasonably good one. To determine good
the average. As a result, this is not the dominant factor and thus orders, we propose two algorithms.
cannot be used to predict a good dimension order.
Local estimates. The first algorithm, local(), aims at fast
H2 : Sub-space contrast quantifying the pruning power. identification of a good dimension order, but does not take corre-
The second indicator (SSCpu – Sub-Space Contrast) denotes the lated dimensions into account. It works either on the computed
average distance according to dist() between two points in D SSCp1 values or a respective estimate, which depends on the applied
for the first u-dimensional sub space of a given dimension order distance function dist(). For Lp metrics, we use the variances
p. We compute SSCpu as the sum of the distances of all point pairs of the dimensions as estimate. Having, the estimates (or correct
and normalize by the number of pairs. For larger data sets, we use SSC values) the dimensions are sorted putting the one with largest
a sample of the data. Based on the definition of SSCpu , its value estimate first. The resulting order directly forms the dimension order.
is the same for all dimension orders if u refers to the whole space We can only rely on the variance if the following holds: Given three
(full-space contrast). However, in case pruning power is the decisive values c, v1 , v2 : dist(c, v1 ) > dist(c, v2 ) ←→ |c−v1 | > |c−v2 |. Us-
property influencing the response time of Elf, we need to maximize ing an example that means, it is required that according to dist()
the sub-space contrast for small values of u to get large distances in the value v1 = 0 is always further away from c = 10 than the value
high levels of Elf by selecting a respective dimension order. v2 = 1. Using the Hamming distance, v1 and v2 would have the
same distance, as both values are not equal to the compared value c.
In Figure 6, the fast and the slow predicted dimension order (dashed
Full-space contrast lines) are computed using this algorithm with variance as estimator.
15,000
Predicting a fast order results in a permutation having a response
SSC value

time of 381 ms, meaning that there are 8.3% of all permutations
10,000
faster than the predicted one. This is arguably not an optimal result,
but the ratio of invested time and achieved response time compared
5,000 to an average response time of 516 ms is decent.

0 Greedy determination. The second algorithm greedy() of-


u=0 u=1 u=2 u=3 u=4 u=5 u=6 u=7 u=8
fers reasonable execution times and considers correlated dimensions
and sub spaces reducing the sub-space contrast. The main difference
Predicted fastest (382 ms) Predicted slowest (838 ms) is that we compute the SSC in an iterative way. First, the dimension
Genetic fastest (312 ms) Genetic slowest (870 ms) with the highest SSC value is determined and forms the already
known part of p. Then, by probing all concatenations of the current
Expected SSC
already known permutation and all remaining dimensions, the al-
Figure 6: Dimension order selection: Exemplary SSC indicators for gorithm selects the concatenation having the maximum SSC value.
different values of u. Interestingly, for the micro benchmark data set and the Euclidean
metric the result is the same as for the first algorithm, but the exe-
cution cost are by far higher. For this second algorithm, for each
Confirmation of SSCpu as dominant factor. We confirm the of the d iterations, we compute in every step u the SSC value of
Hypothesis H2 that the pruning power is the dominant factor in case d-u possible sub spaces. This requires, for each possible sub space
the fastest 50 dimension orders found in the genetic optimization to compute the pair-wise distance for all points in the data set, i.e.,
are among those that converge fast against the full-space contrast for this algorithm’s complexity is higher than O(|D|2 ). As extension,
small values of u. By contrast, for the 50 slowest dimension orders, one could also probe the i best concatenations from each iteration.
we require to observe the opposite behavior. However, the execution time increases exponentially, e.g., in case i
In Figure 6, we depict exemplary SSCpu values for four dimension = d one would test all permutations.
orders and a baseline (black line). The baseline assumes that the
full-space contrast is uniformly distributed among all sub spaces, Table 1: Benchmark data sets from [21]
i.e., the delta between all SSCpu and SSCpu+1 is always the same.
We observe for the two fast example permutations that the SSC for Dim Size real-world property uniform Gaussian
small u is significantly higher than the baseline. For the fastest per- 16 11k Dr16 dense Du16 16
DG
mutation, we found that this is the maximally achievable SSC value 43 412k Dr43 clustered Du43 43
DG
50 50 50
for u < 3. We make the opposite observation for slow dimension 50 130k Dr sparse Du DG
51 51 51
orders. That is, their SSC values are consistently smaller than those 51 3446k Dr clustered Du DG
of the baseline. For both groups of permutations found in the genetic
optimization, these observations hold for all members of the group.
In addition, further tests reveal that dimension orders having consis-
tently similar SSC values as the baseline, have a median response 4.1.4 Empiric generalization
times. Hence, we confirm SSCpu as the dominant performance factor So far, we examined the influence of the dimension order parame-
and use it to analytically predict a good dimension order. ter with one data set resulting in the insight that we can determine
good dimension orders using the SSC value. Now, we aim at gen- second layout minimizes the cost for jumping to the next level (i.e.,
eralizing this result. To this end, we use the same benchmark data next dimension). We depict the resulting Elf for the example data
sets (cf. Table 1) as used in [21]. It consists of 12 data sets. There from Figure 2a in Figure 7. The core difference is highlighted in
are four groups, where the dimensionality and size of all data sets yellow being the second dimension list split into parts. Note, the size
within one group is identical. For each group there are three data of Elf remains the same in both layouts. The third layout considers
sets with different stochastic distributions. One data set contains dimension lists as lists potentially scattered across the memory. That
real-world data, the others are uniform and multivariate Gaussian is, there is no optimized layout.
data. We now investigate whether we can rely on the SSC value in
general. 0 1 2 3 4 5 6 7 8 9
(1) (2) (4) (5)
We use the local() algorithm to find good dimension orders. ELF[00] [02] -[12] 1 -[7] 0 1 T1 -2 -[2] 0
Hence, a slow dimension order is the inverse permutation of the (3)
ELF[10] 0 T2 0 1 0 T3
fast one. In Table 2, we depict the results. For space reasons, we
only depict the results for all real-world data sets and the whole Figure 7: Alternative depth-first memory layout of Elf
16-dimensional group. This is valid as for any other group the
results for the uniform and Gaussian data are, as expected, nearly
the same. In particular, this table contains the quotient of the average 4.2.2 Results
response times of the slow and fast dimension orders as ∆Tresp . In We examine for each layout the average response time for the
addition, it contains the deviation from the SSC values of a good best 50 permutations from the genetic optimization of Section 4.1.1.
dimension P order to the expected SSC value named ∆SSC. It is Again for each permutation, we compute a robust mean value ensur-
exp
defined as ddim=1 SSCfast dim - SSCdim normalized by dividing it
ing that numbers are statistically sound.
exp
by SSCdim . Intuitively, this can be interpreted as the (normalized
discrete) integral of the predicted and the expected SSC graphs in Table 3: Average response time for different memory layouts
Figure 6. Generally, the assumption formulated in Hypothesis H1 is: Layout: MCSP Depth-first List
the larger the difference between predicted and actual SSC value, KnnOptLayout 1152 ms 1176 ms 2513 ms
the better Elf performs. On the other hand differences close to zero KnnConverge 383 ms 460 ms 722 ms
predict no observable difference in ∆Tresp , i.e., for this data set the
dimension order does not matter. The results indicate that optimizing the convergence of the knn
algorithm KnnConverge is more important than finding a good
Table 2: Comparison of delta between SSC and response time
dimension order. This is visible in Table 3. Even for the best dimen-
Data set Dr16 Du16 DG 16
Dr43 Dr50 Dr51 sion order from the genetic optimization, the response times using
∆SSC 0.99 0.17 0.18 21.45 11.65 1.24 KnnConverge are consistently slower than the slowest permuta-
∆Tresp 2.50 1.10 1.10 26.60 41.79 1.76 tion found in the prior section using KnnOptLayout. However,
for the List Elf, without optimized storage layout, the response times
Based on the results in Table 2, we confirm that one can generally are by factor 2 slower. Interestingly, there is no observable difference
rely on the SSC value. First, for small ∆SSC values, such as 0.17, for knn Algorithm KnnOptLayout between the two Elf variants
we observe nearly no difference of the response times of the fast with optimized memory layout. However, for KnnConverge we
and slow dimension orders, i.e., a ∆Tresp value close to 1. Second, observe a difference. It results from iterating twice over a dimension
for a larger ∆ SSC value, we generally observe larger ∆Tresp . In list. Hence, using the depth-first layout means that KnnConverge
summary, the answer regarding RQ1 (cf. Section 3.4) is that we needs to jump twice over the memory for every dimension list, while
need to optimize for large ∆SSC. Then, the corresponding good the sub tree is adjacent. For the default layout the only jump is per-
dimension order in combination with KnnConverge forms an formed examining the sub tree, while the dimension list values are
optimized Elf. In case, ∆SSC is small, we use KnnOptLayout stored adjacent to each other. To quantify the difference, recapitulate
and the dimension order is not relevant. that the response times for the permutations are normally distributed
(cf. Figure 4) with a mean of 516 ms using the first layout. With
4.2 The influence of the Elf data layout 460 ms for the top 50 permutations of the depth-first memory layout,
Besides the concept of sub-space distance equality, Elf features an this value is only slightly better the average of the first layout. In
optimized storage layout. In this section, we investigate whether the summary, we conclude that using the same layout as proposed in [7]
same layout as for MCSP computation shall be used and quantify is expected to result in the best performance being the answer re-
the effect on response time. To this end, we conduct a second garding RQ2 from Section 3.4. Hence, we apply it in the remainder
micro benchmark. We use an Elf without optimized storage layout of the paper.
and in addition test an alternative layout with both knn algorithms
from Section 3.3. Recapitulate, the main difference between both 5. COMPARATIVE STUDIES
algorithms is that KnnConverge is optimized to converge to the The answers regarding the research questions RQ1 and RQ2 in
final distance of the kth nearest neighbor faster by iterating first over Section 4 allow to build optimized Elf indexes. Based on these
each dimension list and then traverse the closest sub tree first. By insights, we compare the Elf approach to state-of-the-art and well-
contrast, the KnnOptLayout is optimized to exploit data locality known competitors using different data sets in a systematic way.
in Elf and does not iterate twice over all values. To this end, we first introduce the study design ensuring a fair
comparison. Then, we present evaluation results.
4.2.1 The memory layouts
We consider three layouts: MCSP, Depth-first, and List. The 5.1 Study design
first is an Elf with optimized layout as shown in Figure 2c. The Ensuring a fair comparison of different approaches in main-
optimization objective is to speedup examination of one dimension memory environments is a non-trivial task. To this end, we explain
list, by storing all its values and pointers adjacent to each other. The how we ensure a fair comparison first.
5.1.1 Index and measurement selection parameter k revealing that their influence is negligible. For brevity,
Besides the sequential scan, which is known to be a highly-potent we only show the results for k = 10. Furthermore, we tested
competitor due to the curse of dimensionality, we use iDistance [17] implementation variants for each index selecting the fastest one and
as state-of-the-art indexing approach for metric spaces. We use repeated the experiments on different hardware all resulting in the
the kd-Tree [3] as classical indexing approach, which is known same findings. Finally, we executed index-specific parameter tuning
to work well in main memory settings due to its small size and for each data set. For Elf that is finding a good dimension order. For
simple algorithm particularity compared to members of the R-Tree iDistance, tuning is difficult as there are three parameters (number
family [15]. Finally, we also include an Elf without optimized of partitions, partition center selection, ∆r) and their influence is
memory layout named List Elf to examine the effect of the optimized not easy to predict [23]. Our tuning results are consistent with [23,
memory layout. 17].
Speedup as comparison measure. We use two different measure- The evaluation is performed on an Intel Core i5 with 2.6 GHz
ments for each triple of index, metric, and data set. The first is the clock frequency having 20 GB RAM. We use Java 8 following the
average response time for executing 1,000 knn queries. To ensure guidelines from [12] for Java benchmarking.
statistical soundness, the 1,000 points are randomly selected. We
repeat each experiment five times. Each experiment returns the
5.2 Results
median of ten measurements, i.e., we compute ten executions of The Euclidean metric is one of the most often used Lp metrics
1,000 knn queries. The average response time is the average of all with p = 2. Hence, we use it in our evaluation.
five experiment medians. We normalize this average response time
by the average response time of the sequential scan leading to the 16 dim. densely pop. 43 dim. clustered pop.
speedup of each index over the sequential scan. As usual, a speedup 100

Speedup w.r.t. sequential scan


value smaller than 1.0 indicates that the corresponding approach 10
is slower than the sequential scan, while higher values indicate the 1
factor of performance improvement over the sequential scan.
Number of distance computations. For a hardware-independent 0.1
comparison, we measure the average number of points per query for 0.01
that an index computes the real distance to the query point, named
50 dim. sparsely pop. 51 dim. clustered pop.
M . As for the response time, we normalize the measurement M n
100
by the number of points visited by the sequential scan (i.e., the data
set size). Hence, a value close to 1.0 indicates that nearly all points 10
are examined. Due to sub-space distance equalities, for Elf, dis- 1
tances are not computed point-, but attribute-wise (cf. Section 3.1). 0.1 Real Artificial
Therefore, we count every distance computation per attribute, i.e., in-
vocation of dist(elem.v,q[dim]) as found in the algorithms 0.01
f f f f
in Figure 3. Then, we divide by the number of dimensions of the El ance Tree t El El ance Tree t El
i st k d- L is ist k d- Lis
data set. This is valid, since for a worst-case Elf without sub-space iD iD
distance equalities (i.e., the values in the first dimension are unique), Figure 8: Speedup for Euclidean metric
Elf performs a sequential scan resulting in the same M n value the
sequential scan has. Note, the M values of Elf and List Elf are
identical. Table 4: Distance computation number M n Euclidean metric
Mn Dr16 Dr43 Dr50 Dr51 DG16
DG43
DG50
DG51
5.1.2 Benchmark data sets Elf 0.002 0.001 0.001 0.004 0.265 0.367 0.491 0.371
We rely on the same data sets as used in [21], already used iDistance 0.013 0.010 0.008 0.061 0.877 1 0.999 1
in Section 4.1.4, to evaluate different dimensionality, stochastic kd-Tree 0.078 0.023 0.163 0.209 0.961 1 1 1
distributions, and amount of points in the data set. These data sets are
particularly well suited to evaluate indexes in a systematic way, as
there are four groups of data sets, where each group consists of three 5.2.1 Real-world data sets
data sets. The groups have different number of dimensions reaching In Figure 8, we depict the speedups for all data sets. Recapitulate
from 16 to 51 and different size reaching from 11,000 points to that a value of 1 on the y-axis means that the respective approach is
3,446,000 points. To give an example, in group D50 all data sets as fast as the sequential scan. A value of 10 in the logarithmic scale
have 50 dimensions and 130,000 points, whereas in group D51 all indicates that the approach is ten times faster than the sequential
data sets have 51 dimensions and 3,446,000 points. Within each scan. In the results, we observe for all real-world data sets (in blue)
group, there is one data set containing uniform data, one containing that Elf consistently outperforms the sequential scan by several
multivariate Gaussian data, and one real-world data set. To improve factors, which also holds for the iDistance. The highest speedup
readability, we use the following notation Dyx to refer to a specific we observe is using Elf for the 43-dimensional real-world data set
data set, where x ∈ {16, 43, 50, 51} indicates dimensionality of the having a speedup of factor 68.23. In three out of four real-world
data set and y ∈ {u, G, r} its stochastic distribution. For example, data sets Elf is the fastest approach. For the 51-dimensional data
Dr51 refers to the real-wold 51 dimensional data set, whereas Du50 set using Elf results in a speedup of about factor 10 compared to
denotes the 50 dimensional uniform data set. a sequential scan. However, the iDistance is two times faster for
that highly clustered data set. We argue that this is explained by
5.1.3 Intrinsic and extrinsic validity the way the iDistance works (cf. Section 2.2): In case its partition
All index implementations are tuned to the same extent. The centers (by proper optimization) correspond to existing clusters, we
same holds for auxiliary structures such as the result set. In addition, found an optimal data set for such indexes. For the List Elf and the
we evaluated different number of queries and different values for kd-Tree, we observe that even for real-world data the sequential scan
outperforms them for two and three data sets, respectively. We also computation (e.g., within a clustering approach) for data sets with
observe a clear superiority of Elf (with optimized memory layout) unknown distribution is a good choice.
to its un-optimized counter part.
Considering the results in Table 4, we conclude that the speedups 6. CONCLUSIONS AND FUTURE WORK
are explainable by the reduction of distance computations using
For many data analysis tasks, like clustering or outlier detection,
indexing approaches. Interestingly, the ratio of M n between the
computation of the k-nearest neighbors is required and computation-
approaches approximately is the same as their difference in speedup.
ally expensive. A novel concept that has the potential to significantly
For valid interpretation why M n values are comparable among all
reduce the number of distance computations, independent of the
approaches, it is required to discuss a detail for the iDistance that
stochastic distribution of the data, is exploiting sub-space distance
also applies to any other metric indexing approach. Elf, kd-Tree, and
equalities. The core result of this paper is that, in case one exploits
sequential scan compute distances using squared Euclidean distance
sub-space distance equalities properly, one can expect to compute at
(i.e., omit the square root) referred to as dist’(). This is not
maximum about 60% of the distance values. This is important as
possible for iDistance as the triangle in-equality used for pruning
even state-of-the-art approaches face severe issues to outperform a
points only works with a metric, which dist’() is not. However,
simple cache conscious sequential scan if dimensionality increases.
upon knn query execution (not upon build) we can in most cases
To investigate the potential of our novel concept, we rely on Elf,
rely on dist’(). Only in the case we found a new kth neighbor,
in index featuring sub-space distance equalities combined with a
i.e., when we need to update dist_max of the result S, we have to
main-memory optimized storage layout. Our results allow to build
additionally compute the square root. These updates are by orders of
optimized Elf indexes by maximizing the sub-space contrast with
magnitudes smaller than M , so for nearly all distance computation
Elf’s dimension order parameter. In a comparative study with 12
also iDistance can rely on faster dist’() instead of dist().
data sets having different properties, we reveal that using Elf results
in competitive performance for all data sets considering highly
5.2.2 Gaussian and uniform data sets potent competitors. Our results reveal that for real-world clustered
In Figure 8, we depict the results for the multivariate Gaussian data the minimum performance increases compared to a sequential
data sets in red. Note, the results for the uniform data sets are almost scan is factor 2, while the largest ones are more than two magnitudes.
the same. Hence, for brevity we only depict the Gaussian results, but The competitors do not achieve this. Furthermore, even for Gaussian
our observations and interpretations also hold for respective uniform and uniform data, no more than 61% of the distances are computed
data sets. and, in any case, the smallest number of distances of all approaches
Generally, as expected, most approaches have severe issues reach- are computed. The reason is that sub-space distance equalities
ing the performance of the sequential scan. The only exception is represent tight bounds (i.e., points instead of regions), which are
Elf. It is the only index that (slightly) outperforms the sequential incrementally refined per tree level. Moreover, as the distances are
scan for one of the data sets (the largest one having 51 dimensions computed attribute-wise instead of point-wise, we can furthermore
and 3 million points) and reaches comparable speedups for the re- reduce the number of computed distances. Finally, if a large number
maining ones. The largest difference (speedup factor of 0.61) is of distances have to be computed, Elf benefits from its optimized
observed for the 16-dimensional Gaussian data, which we deem storage layout, which naturally converges to a row-store-like (i.e.,
acceptable for the following reasons. First, all other indexes are cache conscious) structure. This suggests that one can expect good
by far slower. The next fasted index is the kd-Tree with a speedup performance using an index based on sub-space distance equalities,
factor of 0.13 being nearly an order of magnitude slower. Due to such as Elf, in general.
the small size of the data set, the sequential scan highly benefits of For future work, we are interested in studying the relationship
the large cache sizes of todays CPU. By contrast, tree-based indexes and existence of order-preserving iterative versions of various well-
suffer for the concept related issue that they do not read memory known distance function. First results indicate that the requirements
sequentially, but jump across it. This is nevertheless an interesting of Elf towards the applied distance function are less strict than those
result, as we expect the cache sizes to grow in future as well as main metric approaches have, which require a metric.
memory latency to decrease where such issues become relevant also
for larger data sets. 7. REFERENCES
Examining M n in Table 4 reveals that all competitors degenerate [1] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu.
to sequential scans for Dr43 , Dr50 , and Dr51 . The same holds for An optimal algorithm for approximate nearest neighbor
their uniform counterparts. This is different for Elf, which does searching fixed dimensions. J. ACM, 45(6):891–923, 1998.
not compute more than half of the distances for any data set. The [2] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The
maximum M n value with 0.491 is observed for Dr51 , which means R*-tree: An efficient and robust access method for points and
that on average 49% of all distances are computed. Hence, even for rectangles. In Proc. Int’l Conf. on on Management of Data
such data sets using Elf results in comparable performance. (SIGMOD), pages 322–331. ACM, 1990.
[3] J. Bentley. Multidimensional binary search trees used for
5.2.3 Interpretation associative searching. Commun. ACM, 18(9):509–517, 1975.
Overall, our results indicate that Elf delivers the best performance [4] S. Berchtold, D. Keim, and H.-P. Kriegel. The X-tree: An
for real-world data sets, even if not the best for all data sets, which index structure for high-dimensional data. In Proc. Int’l Conf.
we argue is hardly possible. However, there is a consistent improve- on Very Large Data Bases (VLDB), pages 28–39. Morgan
ment of several factors compared to a sequential scan. Moreover, Kaufmann, 1996.
for Gaussian and uniform data, we state Elf is the only approach [5] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When
that results in comparable performance to a sequential scan, i.e., is is “nearest neighbor” meaningful? In Proc. Int’l Conf. on
resistant to the curse of dimensionality. This becomes visible as for Database Theory (ICDT), pages 217–235. Springer, 1999.
no data set more than 50% of all distances are computed. Therefore, [6] C. Böhm, S. Berchtold, and D. Keim. Searching in
we conclude that selecting Elf as index to speedup Euclidean metric high-dimensional spaces: Index structures for improving the
performance of multimedia databases. ACM Comput. Surv., iDistance: An adaptive B+-tree based indexing method for
33(3):322–373, 2001. nearest neighbor search. ACM Trans. Database Syst.,
[7] D. Broneske, V. Köppen, G. Saake, and M. Schäler. 30(2):364–397, 2005.
Accelerating multi-column selection predicates in [18] M. Micó, J. Oncina, and E. Vidal. A new version of the
main-memory - the Elf approach. In Proc. IEEE Int’l Conf. on nearest-neighbour approximating and eliminating search
Data Engineering (ICDE), pages 647–658, 2017. algorithm (AESA) with linear preprocessing time and memory
[8] L. Chen, Y. Gao, B. Zheng, C. Jensen, H. Yang, and K. Yang. requirements. Pattern Recogn. Lett., 15(1):9–17, 1994.
Pivot-based metric indexing. Proc. VLDB Endow., [19] S. Omohundro. Five balltree construction algorithms.
10(10):1058–1069, 2017. Technical Report TR-89-063, International Computer Science
[9] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient Institute, 1989.
access method for similarity search in metric spaces. In Proc. [20] H. Samet. Foundations of multidimensional and metric data
Int’l Conf. on Very Large Data Bases (VLDB), pages 426–435. structures. Morgan Kaufmann, 2006.
Morgan Kaufmann, 1997. [21] M. Schäler, A. Grebhahn, R. Schröter, S. Schulze, V. Köppen,
[10] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. and G. Saake. QuEval: Beyond high-dimensional indexing à
Locality-sensitive hashing scheme based on p-stable la carte. Proc. VLDB Endow., 6(14):1654–1665, 2013.
distributions. In Proc. An’l Symp. on Computational [22] J. Schneider. Analytic performance model of a main-memory
Geometry (SCG), pages 253–262. ACM, 2004. index structure. CoRR, abs/1609.01319, 2016.
[11] V. Gaede and O. Günther. Multidimensional access methods. [23] M. Schuh, T. Wylie, J. Banda, and R. Angryk. A
ACM Comput. Surv., 30(2):170–231, 1998. comprehensive study of iDistance partitioning strategies for
[12] A. Georges, D. Buytaert, and L. Eeckhout. Statistically kNN queries and high-dimensional data indexing. In Proc.
rigorous Java performance evaluation. In Proc. Conf. on British Nat’l Conf. on Databases (BNCOD). Springer, 2013.
Object-oriented Programming Systems and Applications [24] T. Sellis, N. Roussopoulos, and C. Faloutsos. TheR+-Tree: A
(OOPSLA), pages 57–76. ACM, 2007. dynamic index for multi-dimensional objects. In Proc. Int’l
[13] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high Conf. on Very Large Data Bases (VLDB), pages 507–518.
dimensions via hashing. In Proc. Int’l Conf. on Very Large Morgan Kaufmann, 1987.
Data Bases (VLDB), pages 518–529. Morgan Kaufmann, [25] J. Uhlmann. Satisfying general proximity/similarity queries
1999. with metric trees. Inf. Proc. Letters, 40(4):175–179, 1991.
[14] S. Guhlemann, U. Petersohn, and K. Meyer-Wegener. [26] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis
Reducing the distance calculations when searching an M-Tree. and performance study for similarity-search methods in
Datenbank-Spektrum, 17(2):155–167, 2017. high-dimensional spaces. In Proc. Int’l Conf. on Very Large
[15] A. Guttman. R-trees: A dynamic index structure for spatial Data Bases (VLDB), pages 194–205. Morgan Kaufmann,
searching. SIGMOD Rec., 14(2):47–57, 1984. 1998.
[16] M. Hetland. The Basic Principles of Metric Indexing, pages [27] C. Yu. High-dimensional Indexing: Transformational
199–232. Springer, 2009. Approaches to High-dimensional Range and Similarity
[17] H. Jagadish, B. Ooi, K.-L. Tan, C. Yu, and R. Zhang. Searches. Springer, 2002.

You might also like