Indexing Heterogeneous Data
Nilesh Dalvi
Department of Computer Science
University of Washington
Seattle, WA
[email protected]
Dan Suciu
Department of Computer Science
University of Washington
Seattle, WA
[email protected]
September 27, 2003
Abstract
We consider the indexing problem for heterogeneous data, where objects are sets of attribute-value pairs,
and the queries specify values for an arbitrary subset of the attributes. This problem occurs in a variety of
applications, such as searching individual databases, searching entire collections of heterogeneous data sources,
locating sources in distributed systems, and indexing large XML documents. To date no efficient data structure
is known for such queries. In its most simplified form the problem we address becomes the partial match
problem, which has been studied extensively and is know to be computationally hard. We describe here the first
practical technique for building such an index. Our basic idea is to precumpute certain queries and store their
results. User queries are then answered by retrieving the “closest” stored query and removing from its answers
all false positives. The crux of the technique consists in chosing which queries to precompute. There are several
design choices, and we evaluate them both theoretically and experimentally (with real data sets). One major
advantage of this technique is that it can be easily implemented on a relational database.
1
Introduction
Web search engines have proved highly popular for searching unstructured documents (Web pages). While in
principle a similar technology can be deployed to search structured information sources, this has not happened
so far. We believe that there are two reasons for that. First, there was little incentive. Structured information sources are either beyond firewalls (e.g. corporate data bases), or unique, well known data sources that are
queried by users directly, without any support form a centralized index (e.g. the Protein Sequence Database,
http://pir.georgetown.edu/). The second reason is technical: it is much harder to search and index a
large collection of structured objects than a collection of documents.
We argue that the first reason is quickly disappearing, as more communities need to share their structured
sources on the Web. Consider the case of sharing scientific data. Until recently, individual researchers collected
and analyzed their own data and kept it proprietary, except for a few large, well-known and well-maintained public
databases. But today there is increased pressure on the scientists to share their data. The U.S. National Institute
of Health will expect all funded scientists to release their final research data for use by other scientists [23].
Some scientific journals are making data sharing a condition of publication [22], while others are coming under
pressure to do so [19]. Examples of efforts by scientific communities to share structured information sources are
the SkyQuery project in astronomy [26, 20], the Institute for Systems Biology [15], and the Human Brain Project.
In the business world, vendors of commercial database systems are also considering adding keyword searches to
their systems. While most commercial systems already provide full text capabilities, vendors are now moving
towards supporting searches in both the schema and the content of a database: DataSpot [9] is a middleware,
DBXplorer [3] is an extension of Microsoft’s SQL Server, while DISCOVER [14] describes a general framework
for keyword searches in a database and demonstrates it on Oracle. These systems offer keyword searches over a
single database. The next, and harder step, is to support keyword searches on collections of data sources: this is
needed for example in order to locate relevant sources in large distributed systems [11].
The technical barrier however remains: searching structured data sources is a harder problem than searching
unstructured documents. One challenge is that the translation of keyword searches into SQL becomes too difficult:
the techniques used by the systems above for a single relational database do not scale to a large collection of data
sources. Another challenge is how to reconcile and relate the schemas of the various data sources. There has been
extensive work in the past on addressing schema mappings [21, 13], and we will not be concerned with that in
this paper. Even assuming the schemas have been reconciled, it is unclear how one would search the collection
of heterogeneous data. Unlike information retrieval systems, here we need to support queries that specify schema
elements in addition to data values being searched. In other words the queries need to be schema aware.
Our approach to searching heterogeneous data sources consists of two steps. In the first, all data sources are
mapped into a common data model, in which objects are represented as sets of attribute-value pairs. While this
translation has the disadvantage that all data needs to be copied to a central server, and may not be fully up-to-date
with the original data sources, it has the advantage that sources can be easily added and schema mappings can
be easily encoded in the translation itself. More importantly, it addresses the query translation problem: keyword
searches become now partial match queries. A partial-match query consists of a set of attribute-value pairs, and
its answer consists of all objects that contain all these attribute-value pairs, and possibly more: hence the name
“partial-match”. The second part of our approach deals with techniques for efficiently answering partial match
queries. This will be the focus of our paper. The specific technical problem that we address is how to build
an efficient index for partial match queries on heterogeneous data. Partial match queries have been extensively
studied in the past, starting with Bentley [4], mostly in the context of relational data, and have been considered to be
difficult to answer efficiently. In the case of heterogeneous data the problem becomes even harder, because the total
number of attributes is much larger, and recently some lower bounds on its complexity have been proved [6, 16].
We describe in this paper the first practical index structure for partial match queries on heterogeneous data,
called PMI, for Partial Match Index. In our approach, the answers to a certain subset of queries are precomputed
and stored in the index. The selected queries need to be chosen carefully, in order to allow all queries to be
answered efficiently using the stored answers, and at the same time preventing the size of the index from growing
out of control. An important point for the class of applications that we envision, is that the partial match index
can be implemented easily on a relational database (we used a popular commercial database engine), and a partial
match query can be answered with just a few disk accesses. The paper makes the following contributions:
• It describes an approach to searching heterogeneous data through partial match queries.
• It proposes a simple index structure for partial match queries, that can index an arbitrary set of heterogeneous
data.
• It describes how this index can be implemented using a relational database engine.
• It analyzes the time complexity of the index, and discusses several variations to the basic scheme.
• It evaluates the index structure experimentally, on real data sets.
The paper is organized as follows. Section 2 defines the problem and shows the limitations of naive approaches.
The index is described in Section 3. It is then analyzed and discussed in Section 4. Section 5 presents the
experimental evaluation. We discuss related work in Section 6 and conclude in Section 7.
2
Searching Heterogeneous Data
We describe here our approach to searching heterogeneous data.
2.1
Problem Definition
Data Model The first step is to map the heterogeneous data into a flat semistructured data model where objects
are sets of attribute-value pairs. Thus, a data object d is:
d ::= [A1 = v1 , A2 = v2 , . . . , Ak = vk ]
(1)
where A1 , . . . , Ak are attributes and v1 , . . . , vk are atomic values. The set of attributes is not fixed a priori: an
attribute is just a string. Each object may have a different set of attributes, and multiple occurrences of the same
attribute are allowed.
Data sets can typically be mapped to this simple model quite easily, e.g. by outer-joining relations in a relational model, or flattening elements in an XML instance. For a single relational database the attribute-value pair
model corresponds to the universal relation, described for example in [27]. In order to be able to map multiple
data sources to this model it is important to keep the model flexible and to allow each object to have its own set
of attributes. When schema mappings between sources are available, then attributes have to be renamed, to ensure
that attributes from different sources that are mapped to each other have the same name in the attribute-value pairs
model. Notice that the model is flat, and, as a consequence, some information may be lost during the mapping,
e.g. grouping associations in nested XML documents.
Each object in (1) has a unique object-identifier, oid. We denote with U a set of such objects: U will be our
data instance.
A partial-match query has the form:
q = [B1 = w1 and . . . and Bp = wp ]
(2)
Here B1 , . . . , Bp are attributes and w1 , . . . , wp are constants. Each condition Bi = wi is called a predicate,
hence the query consists of a conjunction of p predicates. The query q matches a data object d ∈ U if all attributes
in q occur in d, and have the same values: formally, referring to Equations (2) and (1), if there exists integers
i1 , . . . , ip ∈ {1, 2, . . . , k} s.t. B1 = Ai1 , . . . , Bp = Aip and w1 = vi1 , . . . , wp = vip . The query specifies only
parts of the attributes and values in the object, hence the name partial-match query. For a query q and data instance
U , we denote with SU PU (q) the set of objects in U that match the query q, and refer to it as the support of q.
Applications The partial match problem has numerous applications. We illustrate here three. First, consider
the special case when there is a single attribute, call it A. Then both objects and queries are sets of strings. A query
q matches an object d iff all strings in q appear in d. This is precisely the setting in information retrieval engines,
where both documents and queries are sets of words. Hence any efficient implementation of partial match queries
also applies to IR engines.
The second application is indexing XML documents. Previous work [8] focused on the navigation part of
XPath queries, by indexing path queries with a single predicate. To evaluate XPath expressions with multiple
predicates, the results of each individual predicate must be joined. A technique that replaces joins with traversal of
a suffix tree is described in [28]. Partial match queries offer an alternative approach to XML indexing, in which the
predicates are evaluated first, then the navigation constraints are checked separately. To evaluate the predicates,
an XML document is first mapped to the attribute-value pairs model by associating an attribute Ai , i = 1, . . . , k
to each path occurring in the document. A partial match query specifies a conjunction of predicates on the XML
documents being searched.
The third application is peer data management systems [12, 13, 17, 1], which propose a flexible architecture
for sharing heterogeneous data. A major unsolved problem in such systems is how to discover peers that hold
relevant data. Most systems proposed so far use some form of flooding or gossiping, which has known limitations.
An exception is [11], which uses a distributed hash table (DHT) to locate data sources in a distributed system.
None of these systems scale well to queries with multiple keywords, which are precisely the partial match queries.
The Problem This paper addresses the following:
Problem 1. (Partial Match Indexing Problem) Given a data instance U , construct a data structure (the index) s.t.
for every partial match query q the answer SU PU (q) = {o1 , o2 , . . . , ok } can be computed efficiently.
Example 2.1 We illustrate the partial match problem with an example from indexing XML data. Consider the
Protein Sequence Database, available at http://pir.georgetown.edu/ which is an XML data instance
containing entries about proteins. We map this XML data to our data model by flattening it1 . Every sequence of
XML tags becomes a new attribute, for a total of 75 attributes, like:
header/created_date
reference/refinfo/authors/author
reference/refinfo/year
organism/variety
. . .
The data is truly semistructured: different protein entries have different sets of attributes, and some attributes
may have multiple occurrences in the same entry. Consider now a typical partial match query:
[refinfo/year = ’2000’ and
reference/refinfo/authors/author = ’White,_O.’
and reference/refinfo/citation = ’Nature’
and feature/feature-type = ’domain’]
Such queries are often formulated by scientists when searching scientific databases. They use multiple predicates to narrow their search, much in the same way we use multiple keywords in Web search engines.
The data set has about 260,000 objects (protein entries); the query matches only 43 objects. The challenge in
designing a partial match index is to be able to retrieve these 43 objects with a reasonably small number of disk
I/Os.
What is interesting about this query is to examine the number of objects that match each individual predicate.
This is shown in the table below (we show only the last element label, for succinctness):
Predicate
year=’2000’
author=’White, O’
citation=’Nature’
feature-type=’domain’
all four predicates
No. of Matches
37,335
37,822
63,220
35,004
43
Each individual predicate matches more objects (by three orders of magnitude) than the entire query. As we
show next, this gap makes the partial match indexing problem hard.
1
Some information about grouping is lost during flattening.
2.2
Naive Approaches and Their Limitations
We discuss here a few simple approaches to the partial match indexing problem and highlight their limitations.
We start by considering approaches based on a relational database engine, then briefly discuss other techniques.
Recall that U denotes the data instance.
The Join Approach Define the data table to be a relation RU(Oid, Attr, Val) which stores all attribute
value pairs for all objects in U . That is, an object d = [A1 = v1 , . . . , Ak = vk ], having identifier o, will be
decomposed into k tuples, which are stored in RU: (o, A1 , v1 ), . . . , (o, Ak , vk ). A partial match query of the form
(2) can be answered exactly by the following SQL query:
SELECT x1 .oid
FROM RU x1 , RU x2 , . . . , RU xp
WHERE x1 .Oid = x2 .Oid and
...
x1 .Oid = xp .Oid and
x1 .Attr = ’B1 ’ and x1 .val = ’w1 ’
and . . . and
xp .Attr = ’Bp ’ and xp .Val = ’wp ’
To make this technique practical we build two indexes for the table RU, a clustered index on (Attr, Val),
and an unclustered index on Oid. If one of the p predicates is very selective, the optimizer has a very efficient
plan: start by looking up the selective predicate in the (Attr, Val) index, then index-join the result with the
other p − 1 tables. But if none of the predicates is selective, then there are no efficient plans. An alternative to
the RU table is a universal table, where each data item is a tuple with lots of null values. This, however, is not a
feasible representation because of two reasons. First, according to the data model we have assumed, the complete
set of attributes is not know a priori. Secondly, even if this is known, a large set of attributes will make the size of
the universal table prohibitively large. In fact, some commercial database systems impose an upper bound on the
number of attribute allowed in a table.
Example 2.2 (cont’d) Continuing Example 2.1, the partial match query with four predicates gets translated into a
four way join. The query took 304453 logical2 disk I/O’s to run on a SQL Server database to find the 43 objects
in the answer.
The False Positives Approach An alternative approach is the following. Pick some i = 1, . . . , p, and execute
the following SQL query:
Qi : SELECT x.Oid
FROM RU x
WHERE x.Attr = ’Bi ’ and x.Val = ’wi ’
This query translates into a single lookup in a clustered index, which is very efficient. The answer to Qi
contains all answers to the partial match query q, plus false positives. We need to eliminate these in a separate
step, by iterating over all resulting Oid’s and checking the other p − 1 predicates3 . We can optimize this: using
data statistics, pick that i for which SU PU (Qi ) is smallest. Or, if data statistics are not available, then execute
all queries, Q1 , . . . , Qp (they can all be run in the same stored procedure), and return the result with the smallest
2
The number of logical disk I/O’s reported by the SQL Server is the total number of accesses to disk pages, regardless of whether the
pages are in the buffer cache or not. It is a more accurate measure of the query’s complexity than the running time.
3
This approach assumes that objects can be accessed efficiently by their oid.
Distribution Across Queries on Protein Database
100000
2 predicates
3 predicates
4 predicates
6 predicates
Ratio
10000
1000
100
10
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction of queries with higher ratio
Figure 1: The ratio |SU PU (pi )|/|SU PU (q)| for various queries q.
cardinality4 . Again, this approach works well when one of the predicates is very selective. Otherwise, the postprocessing step becomes quite inefficient.
Example 2.3 (cont’d) Continuing Example 2.1, the partial match query can be answered in two steps. First,
answer the query feature-type=’domain’, which returns 35,004 items. This query took only 259 logical
disk I/O’s to run on SQL Server. Next we need to scan over them to find the 43 items that match the entire query.
Unfortunately, there are no efficient solutions for this step, other than a sequential scan over the 35,004 objects.
Fetching these objects takes at least an additional 35,004 logical disk I/O’s.
Discussion An efficient solution to the partial match indexing problem should answer a query q in time proportional to |SU PU (q)|. Instead, the approaches described above take time at least proportional to |SU PU (pi )|, where
pi is the most selective predicate in q. The question is how likely are queries for which |SU PU (pi )| ≫ |SU PU (q)|,
like the query illustrated in Example 2.1, where the gap was three orders of magnitude. We argue that they are
quite likely. The analogy in Web search engines is when users type multiple keywords: this usually happens after they tried single keywords and received too many answers. Anecdotal evidence we collected from scientists
interested in sharing data through peer data management systems is that they would need queries consisting of
several predicates, say 4 to 6 or more, in order to narrow their search to only few data sources: no individual
predicate is selective enough to narrow the search. Finally, in a more systematic investigation, we analyzed the
Protein Sequence Database in Example 2.1 and inspected all possible queries with 2, 3, 4, and 6 predicates that
return some non-empty answers on the data set. For each of them we computed the fraction f between the number
of items matched by the most selective predicate and the number of items matching the entire query. For example,
for the query in Example 2.1 this number would be f = 35, 004/43 = 814, because the most selective predicate
matched 35,004 objects, while the entire query matched 43 objects. The graph in Figure 1 shows the cumulative
number of queries that have f above a certain value. Reading from left to right on the x axis, 20% of the queries
had f ≥ 625, 40% of the queries had f ≥ 45, etc. In other words, if one chooses a query at random, with a
probability of 0.2 the value f is ≥ 625. Thus, queries for which |SU PU (qi )| ≫ |SU PU (q)| exists, and form a
non-negligible fraction of the set of all queries.
4
One may be tempted to intersect the set of oid’s in the p answer sets, but this brings us back to the join approach.
Other Approaches to Partial Match Queries Bitmaps are an effective way to perform intersections. Several
database systems support bitmap indexes. The problem here is that, when used for partial match queries, the
bitmaps take in the worst case time
P that is proportional to the number of objects matched by each predicate. Thus,
their runtime is proportional to i |SU PU (pi )|, and not to |SU PU (q)|. Compression techniques can speed up
execution, and could prove effective in certain cases in practice. But they lack theoretical guarantees. In this paper
we focus solely on techniques that have strong time guarantees, and do not consider compressed bitmaps.
Early approaches to the partial match problem [4] were designed in the context of a single table, with a fixed
and not too large set of attributes. They are not appropriate for heterogeneous data.
3
The Partial Match Index (PMI)
We describe here our Partial Match Index, PMI. It is implemented on top of a relational database, and consists of
only a few tables, each with an associated clustered index. The PMI has two parameters, chosen by the systems
administrator to adjust the time/space trade off: S, a natural number, and ε > 0. The number of logical disk I/O’s
needed to answer a query q is guaranteed to be ≤ max(S, (1 + ε)|SU PU (q)|), plus a negligible overhead.
The Tables PMI consists of the following tables:
• The table RU(Oid, Attr, Val) was introduced in Sec. 2.2 and has a clustered index on (Attr,
Val).
• The table RU’, which is a copy of RU, but with a clustered index on Oid.
• A number of tables L1 , L2 , . . . , Lk , . . . The exact number of tables depends on the data instances, but is
typically small (say, under 10 or so). For each k ≥ 1, Lk has the schema Lk (Oid, Attr1 , Val1 ,
..., Attrk , Valk ). An entry (o, A1 , v1 , . . . , Ak , vk ) in Lk denotes the fact that the object o contains
all attribute values pairs (Ai , vi ), i = 1, . . . , k, and possibly more. In other words, o is in the answer to
the query q = [A1 = v1 , . . . , Ak = vk ]. The table Lk will contain all answers o to some queries q. We
will explain below which queries are selected to be included in Lk . To eliminate redundancies, all tuples
(oid, A1 , v1 , . . . , Ak , vk ) in Lk have the predicates ordered lexicographically, i.e. (A1 , v1 ) < (A2 , v2 ) <
. . . < (Ak , vk ). By definition (A, v) < (B, w) if A < B, or A = B and v < w.
We first describe how to search in the PMI, then define which queries are stored in the level k tables.
Answering Queries with the Stratified Index Consider a query q = [B1 = w1 , . . . , Bp = wp ]. We first
compute a set of oids denoted f ind(q), which contains all answers to q plus some false positives, and |f ind(q)| ≤
max(S, (1 + ε)|SU PU (q)|). Then we iterate over all oids in this set and eliminate the false positives.
The function f ind(q) is computed as follows. Assume that (B1 , w1 ) < (B2 , w2 ) < . . . < (Bp , wp ). For each
k = 1, . . . , p, let qk = [B1 = w1 , . . . , Bk = wk ] be the subquery consisting of the first k predicates in q. For each
k = p, p − 1, . . . , 1, lookup qk in Lk . If we find it, then f ind(q) is the corresponding set of oid’s from Lk (this
is precisely SU PU (qk ), hence it contains all answers to q plus some false positives). Otherwise we continue with
k − 1, etc. If we reach k = 1 and still couldn’t find q1 = [B1 = w1 ] in L1 , then we search q1 in RU and return
the set of oids from RU. The number of logical disk I/O’s to compute f ind(q) is negligible: it consists of at most
p unsuccessful index lookups, followed by one successful lookup, which returns |f ind(q)| oid’s
To eliminate the false positives, we iterate over all oid’s in f ind(q), and check the remaining predicates Bi =
wi , for i = k + 1, . . . , p, by looking up that object in RU’. This step takes |f ind(q)| ≤ max(S, (1 + ε)|SU PU (q)|)
logical disk I/O’s (assuming that each object fits on one page) because of the clustered index on Oid.
Defining Lk The difficult problem is to decide which queries to store in Lk : if we choose to include all
queries, then the size of the entire PMI is prohibitively large. We start by including in L1 all unary queries q
for which |SU PU (q)| > S. That is, L1 contains only those (attribute, value)-pairs with high support. Next we
build the tables L2 , L3 , . . . , in this order, as follows. To build Lk , we consider all queries with k predicates that
occur in the data instance U : q = [A1 = v1 , . . . , Ak = vk ], and decide whether to insert q in Lk or not. By
“inserting q in Lk ” we mean inserting all tuples (oid, A1 , v1 , . . . , Ak , vk ), for oid ∈ SU PU (q). To make that
decision, we check the following condition. Assume that (A1 , v1 ) < (A2 , v2 ) < . . . < (Ak , vk ), and consider
the query qk−1 = [A1 = v1 , . . . , Ak−1 = vk−1 ]. Compute the set Answ = f ind(qk−1 ): this is the approximate
answer that we would obtain if we try to answer q, and q were not in Lk . Notice that it is possible to compute
f ind(qk−1 ) at this stage because all tables below Lk are already computed. We insert q in Lk if S <| Answ |
and (1 + ε) | SU PU (q) |<| Answ |. To see why, consider what happens if we don’t insert it. In that case
f ind(q) = Answ. Clearly SU PU (q) ⊆ Answ, but the question is how many false positives do we have. We
know that either | Answ |≤ S, or | Answ |≤ (1 + ε) | SU PU (q) |, hence f ind(q) ≤ max(S, (1 + ε)|SU PU (q)|).
This completes the definition of the PMI. Assuming that each object can be stored on one page, the following
holds:
Theorem 3.1. The number of logical disk I/O’s needed to answer a partial match query q with the PMI is ≤
max(S, (1 + ε)|SU PU (q)|), plus a negligible amount.
Example 3.2 (cont’d) Continuing example 2.1, we answer the query q by issuing the following query5 :
SELECT x.Oid
FROM L4 x
WHERE x.Attr1 = ’author’ and x.Val1 = ’White, ’
and
x.Attr2 = ’citation’ and x.Val2 = ’nature’
and
x.Attr3 = ’feature-type’ and x.Val3 = ’domain’
and
x.Attr4 = ’year’ and x.Val4 = ’2000’
(Notice that we ordered the attributes lexicographically.) This query returned the 43 answers directly, and used
only 5 logical disk I/O’s on SQL Server. Recall that the join-approach took 304453, while the false-positives
approach took 259 + 35004 logical disk I/Os.
We defer a discussion on the size of the PMI to Sec. 4. Here, we discuss first how to build the PMI.
3.1
Construction of PMI
The definition of the PMI cannot be turned into a construction procedure, because it enumerates all queries in the
data, which is too inefficient. The technique we use to build the PMI is somewhat similar to the item set generation
algorithm in data mining. We assume that the data instance U is available in a file and can be read sequentially.
We build the indexes iteratively. For each k, the construction of Lk requires exactly two scans over U . At each
level, we create two intermediate tables:
• CN Tk is the count table and it stores all k-predicate queries which go into Lk along with their count.
• CN Dk is the candidate table and it stores all k-predicate queries q such that for any i ≤ k, the query formed
by taking the first i predicates of q does not belong to CN Ti with count less than S.
The queries in CN Dk are the only candidates for insertion into Lk . This is because if some Li stores a query
q with support less than S, none of its extensions have to be considered in higher level indexes, as all of them can
be answered using q.
5
We omit the complete paths and show only the last tag. In practice, we assigned a separate identifier to each of the 75 distinct paths in
our data set and used that instead.
q1
q2
q3
q4
q5
: (p1 , p2 , p3 , p4 )
: (p1 , p2 , p3 , p5 )
: (p1 , p5 , p6 , p7 )
: (p2 , p5 , p6 , p8 )
: (p2 , p5 , p8 , p9 )
p2
p1
p2
p5
p3
p6
p5
p6
p8
p4
p5
p7
p8
p9
Q1
Q2
Q3
Q4
Q5
Figure 2: Sample Queries and Query-Tree
Now we describe how to build the level k index, having constructed all lower level indexes.
CN Dk is initialized to the subset of CN Dk−1 × CN D1 for which the (k − 1)th predicate is smaller than the
k th predicate (because we only store queries in lexicographical order). By the definition of CN Dk , any query not
in CN Dk−1 cannot be the initial part of a query in CN Dk . This significantly reduces the number of queries we
need to consider for Lk .
Now we construct the count table. Note that since CN Dk only contains a set of queries and not their support,
its size is small compared to the data. We load the CN Dk table into the main memory. Then, we perform a single
scan over U and with each data item read, we update the counts of all the queries that contain it.
To speed up this computation, we create a tree representation of CN Dk in memory. Every node in the tree is
a predicate and every leaf corresponds to a query. Figure 2 shows few queries belonging to CN D4 , where each pi
denotes a predicate, and the corresponding tree.
For each data item read, the tree is searched from top to bottom. If the data item doesn’t satisfy some node,
the whole subtree below the node is pruned. The query-tree representation not only results in sharing of computations(testing predicates) between queries, it also compresses the size of CN Dk .
All candidate queries along with their counts are inserted in CN Tk . Next we prune the count table. We remove
a query q = (a1 , v1 , · · · ak , vk ) from CN Tk if there exists a query qi = (a1 , v1 , · · · ai , vi ) in CN Ti with count
less than (1 + ε) times the count of q. We are left with precisely those queries in CN Tk that need to be inserted in
Lk . We load back CN Tk into memory as we did for CN Dk , and perform another scan over U and insert into Lk
data items along with the matching queries. Finally, we prune the CN Dk table by removing those queries which
occur in CN Tk with count less than S.
3.2
The Cut-off Optimization
In some applications queries that return too many answers do not have to be answered exactly. If a query returns
too many answers, users are likely to refine their query (by adding more predicates) and are not interested in seeing
all the answers. We can exploit this in the PMI and save space by removing queries with large support.
Let Qu be a number representing the threshold after which exact answers are no longer interesting. There are
two ways to implement this optimization:
1. Hard Cut-off If a user issues a query q s.t. |SU PU (q)| > Qu , then an error message is returned.
2. Soft Cut-off If the user issues a query q s.t. |SU PU (q)| > Qu , then only Qu objects need to be returned,
together with a warning message.
To chose the PMI parameters, we proceed as follows. Choose S > Qu to represent an upper bound on the
cost that is acceptable to answer a query, and define ε = S/Qu − 1. In this way, if the user issues a query q
s.t. |SU PU (q) ≤ Qu , then the number of objects inspected by the PMI is ≤ max(S, (1 + ε)|SU PU (q)|) ≤
max(S, (1 + ε)Qu ) = S.
The PMI definition is then modified as follows. For the Hard Cut-off, do not store any queries in index with
support more than S, because by definition of PMI, these queries are used to answer only those queries whose
support is > S/(1 + ε) = Qu . If a user issues a query with support greater than Qu , it is simply not answered. In
the Soft Cut-off, for queries with support greater than S we store only S tuples chosen randomly. Again, this does
not affect queries of size less than Qu and they are answered exactly and efficiently. For queries larger than Qu ,
the index only returns Qu tuples (or more) instead of the complete answer.
3.3
Handling Updates
Let U be the data instance on which the index is constructed. Let ∆U be a set of data items to be inserted or
deleted from the data instance. We describe a soft update to the index that takes time proportional to the size of
∆U . The resulting index answers queries on the new data correctly and provides performance gurantees similar
to the old index provided ∆U is small compared to U . The soft update is carried out as follows.
• Soft Insert : Let ∆U be the set of data items to be inserted. For each level k, CN TK is loaded into memory
and ∆U is scanned. All data items that match any of the queries in CN Tk are inserted into Lk along with
the matching queries.
• Soft Delete : Let ∆U be the set of data items to be deleted. For each level k, delete from Lk all data items
contained in ∆U .
It is easy to see that in the new index, each query in the index stores correct answer. However, the index no
longer provides any performance gurantees as ∆U may change the ration of the sizes of a query and its subquery.
This will not be significant for small updates. The performance can progressively detoriate as more and more
updates occur. To restore the performance, the index can be reconstructed from the new data.
To summarize, changes in the data instance can be handled using quick soft updates, along with infrequent
rebuilds when the current data becomes significantly different from the original data.
4
Analysis and Variations
We now discuss and analyze the Partial Match Index (PMI) theoretically and place it in the context of related work
done on the set containment problem. Our discussion will also lead us to alternative indexes for the partial match
problem, which we will compare experimentally to PMI in Sec. 5.
4.1
Background: The Set Containment Problem
An extensively studied problem that is equivalent to the partial match problem is the set containment problem. Let
[m] = {1, 2, · · · m}. An object s is a subset of [m], s ⊆ [m]. As before we associate an oid to each object, and
denote with U a data instance, i.e. a set of objects. A query is also a subset of [m]. The answer q consists of all
objects in U that contain q. We also call this set the support of q, SU PU (q) = {s|s ∈ U, q ⊆ s}. Notice that
q ⊆ q ′ implies SU PU (q) ⊇ SU PU (q ′ ) and if q = ∅ then SU PU (q) = U .
We next define the set containment problem.
Problem 2. (Set Containment Indexing Problem) Given a data instance U construct a data structure (the index)
s.t. for every query q ⊆ [m], the answer SU PU (q) = {s1 , . . . , sk } can be computed efficiently. We will denote
n = |U | the number of objects in the data instance.
The partial match problem and the set containment problem can easily be reduced to each other as follows.
Consider a partial match problem, and number all possible attribute-value pairs occurring in the dataset from 1
100
1
60
4
40
25
3
4
23
15
2
23
4
45
82
3
2
2
3
55
85
3
1
4
25
25
4
3
4
35
23
15
1
23
30
4
20
Figure 3: Illustration for CIP
to m. Each object can be thought of as a set consisting of attribute-value pairs, and hence, a subset of [m]. A
partial-match query also specifies a set of attribute-value pairs and again can be represented by a subset of [m].
The answer to the query is precisely the set of data items containing the query as a subset. The converse reduction
is trivial.
The time/space trade off of the set containment problem has been studied theoretically. One extreme approach
is answer each query with a sequential scan over the data: here the time complexity is linear in the size of the data,
and so is the space complexity. The other extreme is to precompute and store the answers to all 2m queries: here
the time complexity is sub-linear in the size of the data, but the space complexity is exponential. Until recently it
was not clear that one can do any better than these two extremes.
Charikar, Indyk and Panigrahy [7] found the first index that has a better time/space compromise. We review it
here, introducing our own notations. We call this algorithm CIP.
CIP has a parameter c, that can be adjusted to tune the time/space trade off. The index consists of a tree
obtained follows. Each node corresponds to some query q and stores SU PU (q): multiple nodes may correspond
to the same query. A node corresponding to q has children corresponding to queries of the form q ∪ {p} for
p ∈ [m], p 6∈ q. The tree is constructed recursively as follows:
1. The root stores SU PU (∅) (which is U ).
2. A node x storing a set SU PU (q) has the following children. For each p(1 ≤ p ≤ m), x has a child storing
SU PU (q ∪ {p}) if p 6∈ q and the following condition is true:
|SU PU (q)| ≥ |SU PU (q ∪ {p})| +
n
c
The edge is labeled as p. If the node is not stored, then we say that it is “pruned”: none of its descendants
are stored in this case.
Figure 3 shows a hypothetical tree constructed according to the CIP algorithm. There are four attributes
(m = 4) numbered from 1 to 4. The numbers in the circle denote the size of the support of the corresponding
query. Dotted circles denote the nodes that were pruned.
Given a query q, its answer is computed by the following nondeterministic algorithm. The tree is traversed
starting from the root. If some node x storing SU PU (q ′ ) is reached, then we inspect whether x has at least one
outgoing edge labeled p s.t. p ∈ q. If so, one such edge is chosen nondeterministically and the search continues
with that child. Otherwise, SU PU (q ′ ) is returned: this may contain false positives, which are eliminated in one
linear scan.
100
1
60
2
82
3
2
4
45
85
3
4
40
25
3
55
4
25
30
3
4
4
4
23
15
35
20
Figure 4: Illustration for CIP2
Space Complexity: Each node stores at least nc fewer objects than its parent. Also, the root stores n objects.
Thus, the depth of the tree is bounded by c. Also, the maximum fanout is m. So there are O(mc ) nodes in the tree,
each storing at most n objects. Thus, the space taken by the algorithm is O(nmc ).
Time Complexity: Suppose the search ends at node x. For each p ∈ q, there can be at most nc objects stored
at x that do not contain p. So there can be at most mn
c objects at x that do not contain q and need to be filtered.
So the time taken by the algorithm is O(c + |SU PU (q)| + mn
c ). Notice that this upper bound holds for every
nondeterministic choice of the algorithm.
4.2
Optimizations to CIP
In its original formulation CIP cannot be used as a practical algorithm for the partial match problem. We introduce
here several optimizations and variations to make practical.
Order Optimization In CIP, some query results may be stored more than once. For instance, a query {1,
2} may be stored after expanding the edge 1 followed by 2 as well as expanding 2 followed by 1. In general, a
query with k elements can possibly occur k! times. The CIP algorithm can be modified to remove the storage of
duplicate queries. The following rule is added while expanding a node.
Rule 1. Expand the tree in a breadth-first fashion. Let x be the node to be expanded on behalf of query q. Let the
path from the root to x go through nodes x1 , x2 · · · xk and edges labeled p1 , p2 · · · pk (the edge pk is from xk to
x; notice that q = {p1 , . . . , pk }). Do not expand x if there exists i such that pk < pi and xi has an outgoing edge
labeled pk .
Figure 4 shows the tree in Figure 3 with the order optimization. To see an instance of Rule 1 consider the node
labeled 25 in Fig. 3 reachable by 2 then 1. Its ancestors are labeled 100 and 45. The node 100 has already an
outgoing edge labeled 1 (to 68), and 1 < 2, hence 25 is eliminated.
The search is modified as follows, and made deterministic: at each node x, the smallest p is traversed such that
query contains p and x expands p.
Lets call this algorithm as CIP2. We prove:
Theorem 4.1. In CIP2 algorithm, no query is stored twice.
Proof. On the contrary, let us assume that there is such a query q. Let xi and xj be the nodes that store q, and let
x be their least common ancestor. Let pi and pj be the edges from x leading to xi and xj respectively. We can
assume pi < pj w.l.o.g. By Rule 1, no descendant of x reached by pj can expand pi . So, it is impossible for xj to
store q, leading to a contradiction.
Theorem 4.2. The CIP2 algorithm has the same time bounds as CIP.
Proof. Let q be the given query. After following the search algorithm, suppose we reach a node x that cannot be
traversed further. We will show that there cannot be an edge p such that p ∈ q and p is not expanded because
of Rule 1. This will show that CIP2 and CIP have the same time complexity. So, on the contrary, let us assume
that there is such an edge p. This implies that some parent xi of x expands p. Let pi be the edge that leads to x.
By Rule 1, p < pi . So the search algorithm will take the edge p at x and never reach node x, which leads to a
contradiction. This proves the theorem.
Rule 1 is equivalent to the following simpler rule. Rule 2 is faster to enforce than Rule 1 as we do not need to
check all the parents of a node before expanding it.
Rule 2. Expand the tree in a breadth-firth fashion. Let x be the node to be expanded with incoming edge labeled
p. Let x′ be its parent, and let its incoming edge be labeled p′ . Do not expand x if p′ > p.
Rule (2) enforces a simple structure on the tree. Let p1 , p2 , . . . , pk be the labels on the edges on the path from
the root to some node x. Then p1 < p2 < . . . < pk .
The equivalence of the two rules follows easily from the following lemma.
Lemma 4.3. In algorithm CIP, it a node x doesn’t expand p, none of the ancestors of x expands p.
The lemma is a direct consequence of the fact that number of data items at x not containing p is less that the
number of data items not containing p at any ancestor of x.
Algorithm MULT In CIP2 the the running time for a query q is not guaranteed to be proportional to |SU PU (q)|.
The following change can provide this guarantee, and also motivates the construction of the PMI below: replace
the additive condition for expanding the node in the CIP algorithm by a multiplicative condition. For that we
introduce a new parameter ε > 0. A node x storing a set SU PU (q) has a child corresponding to query q ∪ {p} if
the following is true:
|SU PU (q)| ≥ |SU PU (q ∪ {p})| + ε|SU PU (q)|
Time Complexity: Let q be the query to be answered, and consider a node x (corresponding to some query q ′ ) that
does not expand any of the elements in q. There are |SU PU (q ′ )| objects stored at x. Therefore, given any p ∈ q,
there are at most ε|SU PU (q ′ )| objects stored at x that do not contain p. In all, there are at most εm|SU PU (q ′ )|
1
sets not containing q. So time taken by the algorithm is 1−mε
|SU PU (q)|.
log n
n
log 1/(1−ε) ).
Space Complexity: The maximum fanout is m and the depth is log log
1/(1−ε) . So the space required is O(nm
The Rule 1 can be applied to MULT as well to achieve the order optimization 6 .
4.3
Algorithm PMI
Finally, we can express our Partial Match Index PMI described in Sec. 3 in a way in which it can be analyzed and
compared with the variations on CIP described here.
Recall that PMI takes two parameters, S and ε. PMI also builds a tree, with each node storing SU PU (q) for
some query q. We apply Rule 2, hence the association from nodes to queries is one-to-one, and each path in the
tree has increasing edge labels. We prune nodes according to a multiplicative criteria. But unlike the previous
algorithms, a node can be pruned while some of its descendants are not pruned. We call a node alive if it is not
pruned.
Algorithm PMI The tree is constructed breadth-first:
6
Its simplified version, Rule 2, is not applicable for MULT
100
1
L1
60
2
L2
L3
82
3
4
40
25
3
4
2
45
85
3
55
4
25
30
3
4
4
4
23
15
35
20
Figure 5: Illustration for PMI
1. The root is associated to the query q = ∅ and stores SU PU (∅) (which is U ). It has m children, associated to
the queries {1}, . . . , {m} respectively; the edges entering these children are labeled 1, . . . , m respectively.
2. Let x be a non-root node, associated to some query q, and with incoming edge labeled p (p ∈ q). Then x has
children associated to q ∪ {p′ }, for p′ > p, p′ 6∈ q. (This enforces Rule 2.) The child associated to q ∪ {p′ }
has incoming edge labeled p′ .
3. If node x associated to query q has some live ancestor x′ associated to q ′ s.t. either (a) |SU PU (q ′ )| ≤ S
or (b) |SU PU (q ′ )| ≤ (1 + ε)|SU PU (q)|, then x is pruned and SU PU (q) is not stored. Notice that its
descendants are not necessarily pruned. If condition (a) holds for x then it also holds for all its descendants
and these are pruned too; if condition (b) holds then some of the descendants may be alive. If neither
conditions (a) or (b) hold then x is alive, and SU PU (q) is stored at node x.
Given a query q = {p1 , . . . , pk } its answer is computed by the following algorithm. Assume p1 < p2 < . . . <
pk , and let x be the node associated to the query q (x is reachable from the root by following the edges labeled
p1 , p2 , . . . , pk in this order). Starting from x, follow the path to the root until the first node x′ which is alive (x′
is x if x happens to be alive). Return SU PU (q ′ ), where q ′ is the query associated to x′ : this may contain false
positives, which are eliminated in a linear scan.
Figure 5 shows the same tree as 3 with algorithm PMI. In the actual data structure the pruned nodes need to be
eliminated, and their children promoted.
The connection to the PMI relational structure described in Sec. 3 is the following. Lk contains precisely the
queries q corresponding to the live nodes at depth k in the tree.
Time Complexity: It takes O(m) time to find the node x corresponding to the given query q. The set Answ =
SU PU (q ′ ) that is actually returned satisfies |Answ| ≤ max(S, (1 + ε)|SU PU (q)|). Hence the running time is
bounded by O(m + max(S, (1 + ε)|SU PU (q)|)).
Space Complexity: To analyze the space complexity, we proceed in two steps. First we express the space as
a function of the number of live-leaf nodes, then estimate the number of live-leaf nodes.
A live-leaf node is a node x which is alive, and which has no live descendants. If one collects all live nodes
in the PMI tree then the live-leaf nodes form the leaves of this subtree. Let the live-leaf node x be associated to a
query q; then:
|SU PU (q)| ≤ S
(3)
Moreover, let x0 be the parent of x in the PMI tree (x0 is not necessarily live), and q0 = q − {max(q)} be the
query associated to x0 . Then:
|SU PU (q0 )| > S/(1 + ε)
(4)
Indeed, suppose not. Then, in particular x0 is not alive (otherwise it would be a live-leaf node, contradicting the
fact that x is alive). Let x′ be the first live ancestor of x, with associated query q ′ . We have |SU PU (q ′ )| > S and
|SU PU (q0 )| ≤ S/(1 + ε), which implies that |SU PU (q ′ ) > (1 + ε)|SU PU (q0 )|, contradicting the fact that x0 is
not alive. Thus, we have shown that live-leaf nodes satisfy Equations (3) and (4).
We first prove:
Proposition 4.4. Let L be the total number of live-leaf nodes in a PMI tree. Then the total number of live nodes
is ≤ L(2 + log(n/S)
log(1+ε) ).
Proof. Equation (4) says that the parent of a life-leaf node x contains at least S/(1 + ε) data objects. Each live
ancestor of x will have at least a factor of (1 + ε) more data objects than the previous live ancestor. Hence the
total number of live ancestors of x, including itself, is ≤ 1 + log1+ε (n(1 + ε)/S) = 2 + log(n/S)
log(1+ε) . This proves the
proposition.
We now examine the number of live-leaf nodes, denoted L. Rather than giving a general upper bound, we
consider a probability distribution on the attributes in [m] = {1, 2, . . . , m}, and compute the expected value for L.
Let pr(p) denote the probability that the attribute p belongs to a data object. In other words, the expected Q
number of
data objects d that contain p (p ∈ d) is n × pr(p). Given a query q = {p1 , . . . , pk }, we denote pr(q) = i=1,k pi .
Assuming the probabilities of the attributes to be independent, the expected value of |SU PU (q)| is e(q) = n ×
pr(q). Given a query q and a number S ≤ n, we denote with pr(q, S) the probability that |SU PU (q)| > S. We
have:
X n
(pr(q))t (1 − pr(q))n−t
pr(q, S) = pr(|SU PU (q) > S) =
t
t=S+1,n
We need an upper bound for pr(q, S), and for that we use the following:
n
S
n
n−S
n
n−S
≤
=
t−S
t
−
S
S
t
t
S
which implies:
X n − S
n
n
t−S
n−t
S
pr(q) (1 − pr(q))
<
pr(q, S) ≤ (
pr(q)S < (e(q))S
pr(q) )
t−S
S
S
t=S+1,n
Recall that e(q) is the expected size of the answer of q.
Now we will compute the expected number of live-leaf nodes L in the PMI tree. Let x be a node associated
to the query q. The probability that it is a live-leaf node follows from conditions (3) and (4) which characterize
live-leaf nodes. Namely, this probability is pr(q, S)(1 − pr(q0 , S/(1 + ε))), where q0 = q − {min(q)}. Hence
the expected number of live-leaf nodes is bounded by:
X
L =
pr(q0 , S/(1 + ε))(1 − pr(q, S))
q⊆[m]
≤
X
pr(q0 , S/(1 + ε))
q⊆[m]
≤ m
X
(pr(q0 , S/(1 + ε)))
X
(e(q0 ))S/(1+ε)
q0 ⊆[m]
≤ m
q0 ⊆[m]
(5)
Comparision of PMI and CAP on Protein dataset
1.2e+08
Data size
CAP
epsilon=0.05
epsilon=0.25
epsilon=1.00
epsilon=2.00
epsilon=4.00
epsilon=20.0
8e+07
6e+07
Data size
CAP
epsilon=0.00
epsilon=0.05
epsilon=0.25
epsilon=1.00
epsilon=4.00
epsilon=20.0
9e+07
8e+07
7e+07
Index Size
Index Size
1e+08
Comparision of PMI and CAP on FMA dataset
1e+08
4e+07
6e+07
5e+07
4e+07
3e+07
2e+07
2e+07
1e+07
0
10000
100000
0
100
1000
S
10000
100000
1e+06
S
(a)
(b)
Figure 6: Index Sizes for CIP2 and PMI
So far we have not made any assumptions on the probability distributions of the attributes, pr(p). When the
distribution is uniform, i.e. there exists a constant c, 0 < c < 1, s.t. ∀p ∈ [m], pr(p) = c, then expected
value of |SU PU (q0 )| is e(q0 ) = nck , where k = |q0 |, and Equation (5) becomes mn(1 + cS/(1+ε) )n . For a
Zipf distributions, e(q0 ) = nck /k, for some constant c,Q0 < c < 1. For example, in the case of a simple Zipf
distribution where pr(p) = 1/p, we have e(q0 ) = n × p∈q0 1/p ≤ n × 1/k! ≈ n × 1/(k × ek/e ). In this case
Equation (5) becomes m(1 + cS/(1+ε) )n , i.e. a factor of n less than the uniform case. In summary, we have:
Proposition 4.5. The expected number of live-leaf nodes L is bounded by mn(1 + cS/(1+ε) )n , in the case of a
uniform probability distribution on the attributes, and by m(1 + cS/(1+ε) )n in the case of a Zipf distribution, where
c is some constant, 0 < c < 1.
Together, Propositions 4.4 and 4.5 give us the space complexity of the PMI.
5
Experiments
We evaluated algorithms CIP2 and PMI on two datasets. Experiments were ran on a Linux 930 MHz processor
with 256MB memory using C to construct the indexes and Microsoft SQL Server to execute queries on the indexes.
The first dataset was the Protein Sequence Database (http://pir.georgetown.edu/), which consists of
integrated collection of protein sequences. It has 75 distinct attributes. Not all data items have all the attributes
and some data items have multiple values for the same attribute. There are around 260,000 data items with a total
of 17.3 million attribute-value pairs and the total size of the database is 466 MB. The second database is called
FMA and consists of the metadata of MRI scans of brain. It is 166 MB in size with 1.4 million data items. Each
data item has exactly 7 attributes.
CIP2 uses a parameter c while PMI uses S. To compare them we chose c such as to give the same guarantees
as PMI. For Protein database, c was S/10 as PMI stored queries with a maximum size of 10. For FMA database, c
was put as S/7.
Figure 6 (a) shows the comparison of the sizes of indexes constructed using CIP2 and PMI on a log scale as a
function of the parameter S. For PMI we potted several curves for various values of the parameter ǫ. For a fixed
value of S, we observe that PMI performs better than CIP2 above a certain threshold value of ǫ beyond which
15
✂✁✁✂ ✂✂
✡✠✡✠✠
✡✡✠✠
✟✟✞✞✟✞✞
10
✡✡✠✠
✡✡✠✠
✟✟✞✞
✡✡✠✡✠✠
✝✆✝✆✆
✝✝✆✆
✝✝✆✆
✝✝✆✝✆
L1
✡✡✠✠
☞☞☛☛
✟✟✞✞
✟✟✞✟✞✞
✟✟✞✞
✟✟✞✟✞
L2
epsilon= 0.25
☞☞☛☛
✟✟✞✞
5
175
☞☞☛☛☞☛☛
✟✞✟✞
✡✡✠✡✠✠
✡✡✠✠
✡✠✡✠
✡✡✠
L3
(a)
☞☞☛☛
"timing.data"
180
epsilon= 4.00
epsilon= 1
Time (seconds)
Number of Tuples (millons)
Index Construction Times for Protein Database
185
170
165
160
155
150
145
☞☞☛☛
☞☞☛☞☛☛
✍✌✍✌✌
L4
L5
☞
✍✍✌
140
✄☎ ✎✏ ✑✒
L6
135
10000
L7
100000
S
(b)
Figure 7: Size of Lk (S = 16500) and time to construct the PMI (b). For Protein data.
CIP2 performs better. This threshold value decreases as S is increased. Figure 6 (b) shows the same graph for
FMI. Again, PMI performs better that CIP2 above a fixed threshold value of ǫ. But, unlike Protein database, this
threshold value does not increase as S increases. Both PMI and CIP2 algorithms seem to grow at the same rate.
Secondly, this threshold value is very low. Even for ǫ = 0.05, PMI outperforms CIP2 for all values of S.
Figure 7 (a) shows the sizes of Lk for the Protein database (it peaks at k = 3 and is zero after k = 7), while
(b) shows the time needed to construct the index, excluding the time needed to load it into the relational database.
PMI was implemented on Microsoft’s SQL server and we evaluated the efficiency of answering queries in
terms of logical disk I/Os. We compared this with the number of I/Os required to answer same queries using the
join-approach (Sec. 2.2). The results are shown in Figure 8 (a) and (b). Instead of reporting real time (which is
distorted by factors beyond our control, such as cache hits etc) we report the total number of logical disk I/Os
performed by the database, which is equal to the number of reads with cold cache. The comparison shows that
queries using the naive method are around 9000 times slower than those using the PMI index on the Protein
database.
Next we consider the effects of hard and soft cut-offs on the index size. Figure 9 (a) shows, on a log scale,
the sizes of index on Protein database as a function of S with ǫ = 0.05. We can see that soft cut-off results in a
saving of 15 to 35% while hard cut-off reduces space by 40 to 60%. Figure 9 (b) shows the same graph for FMA
database. For this database, with hard/soft cut in effect, the index size actually decreases as S decreases. This
is due to the fact that with a decrease in S, the hard/soft cut makes the index more specialized to answer smaller
queries. With Protein database, this phenomenon was not observed as it is a much larger database. With S being
made sufficiently small, it would also show the same behavior but it would be prohibitively expensive to construct
index on Protein database for such a small S.
To summarize, we have experimental shown that PMI can be highly efficient in answering partial match
queries, while consuming a only a practical amount of space.
6
Related Work
The earliest works on the partial match problem are the k-d-trees by Bentley[4], quadtrees by Bentley and
Finkel[10] and hashing and digital techniques by Rivest[24]. All of these algorithms use space partitioning techniques. They work well only when the number of attributes m is small compared to the number n of data items
100000
Disk I/Os for queries with PMI and Joins on FMA database
100000
10000
1000
With PMI
With Joins
10000
Number of I/O
Number of I/O
Disk I/Os for queries with PMI and Joins on Protein database
1e+07
With PMI
With Joins
1e+06
1000
100
10
100
10
1
10
100
1000
10000
Query Size
(a)
100000
1e+06
1
10
100
1000
10000
Query Size
100000
1e+06
(b)
Figure 8: Efficiency of answering queries measured in logical disk I/Os.
k
(m ≈ log n), and run in time O(n(1− d ) ), where k is the number of attributes specified in the query. In fact, Rivest
k
conjectured that any algorithm with linear space would require Ω(n(1− m ) ) time to answer such queries.
For a general n and m, and in particular, for high-dimensional data (m ≫ log n), there is no algorithm known
that is fast and has small space requirements. Therefore, it is believed [6] that this problem suffers from the curse of
dimensionality. Jayaram et al. [16] proved that in the cell probe model, any algorithm running in space polynomial
in m must make Ω(m) probes. However, these lower bounds do not seem strong enough to justify the curse of
dimensionality conjecture. Charikar et. al. [7] have recently given two algorithms that run in sub-linear time and
sub-exponential space for large m (presented here in Section 4).
A related problem in multidimensional data is the nearest neighbors problem [5]. However, the approximate
version of the nearest neighbor problem does not suffer from the curse of dimensionality [18]. No such analog is
known for the partial match problem.
There has been work [2, 25, 32] on creating indexes to efficiently answer second order queries, i.e. queries
which specify the values of two attributes. However, all of these techniques work in a naive way by considering all
possible second order queries, dividing them into groups and storing the result for each group in a bucket. These
techniques are applicable only when attributes take very few values and cannot be efficiently extended to higher
order queries.
It has been shown [31] that most of the data structures based on space partitioning perform worse than linear
scan for dimensions above 10. Several alternatives have been proposed to optimize the linear data scan like VAfiles [29], signature based techniques [30] and bitmaps. In spite of these optimizations, the fundamental problem
persists: the running time is proportional to the number of dataitems.
7
Conclusions
We have described a simple approach to searching heterogeneous data through partial match queries. We also
described an efficient index structure for supporting partial match queries on heterogeneous data. The index can
be implemented easily on any relational database engine, and provides guarantees on the number of physical disk
I/O’s performed to answer a query.
One interesting application of this index is to structured information sources on the Web. An issue for such an
index is how to crawl the available sources and how to relate their schemas. We leave this for future work.
8e+07
Comparision of various cuts on FMA dataset for epsilon=0.05
3.5e+07
Data size
No cut
3e+07
Soft cut
Hard cut
2.5e+07
Index Size
Index Size
Comparision of various cuts on Protein dataset for epsilon=0.05
1.2e+08
Data size
No cut
Soft cut
1e+08
Hard cut
6e+07
4e+07
2e+07
1.5e+07
1e+07
2e+07
5e+06
0
10000
100000
0
100
S
(a)
1000
10000
S
100000
1e+06
(b)
Figure 9: Hard and Soft Cut-offs
References
[1] Karl Aberer, Philippe Cudré-Mauroux, and Manfred Hauswirth. The chatty web: Emergent semantics through gossiping. In International World Wide Web Conference, Hungary, 2003.
[2] C. T. Abharam, S.P.Ghosh, and D.K.Ray-Chaudhuri. File organization schemes based on finite geometries. Inform.
Contr., 12(2):143–163, 1968.
[3] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A system for keyword-based search over relational databases. In
ICDE, pages 5–16, 2002.
[4] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. CACM, 18(9):509–517, 1975.
[5] Stefan Berchtold, Christian Böhm, and Hans-Peter Kriegel. The pyramid-technique: Towards breaking the curse of
dimensionality. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data,
June 2-4, 1998, Seattle, Washington, USA, pages 142–153. ACM Press, 1998.
[6] Allan Borodin, Rafail Ostrovsky, and Yuval Rabani. Lower bounds for high dimensional nearest neighbor search and
related problems. In Annual ACM Symposium on Theory of Computing (STOC), pages 312–321, 1999.
[7] M. Charikar, P. Indyk, and R. Panigrahy. New algorithms for subset query, partial match, orthogonal range searching,
and related problems. In 29th International Colloquium on Algorithms, Logic, and Programming, pages 451–462,
2002.
[8] B. Cooper, N. Sample, M. Franklin, G. Hjaltason, and M. Shadmon. A fast index for semistructured data. In VLDB,
2001.
[9] S. Dar, G. Entin, S. Geva, and E. Palmon. DTD’s DataSpot: database exploration using plain language. In VLDB, pages
645–649, 1998.
[10] Raphael A. Finkel and Jon Louis Bentley. Quad trees: A data structure for retrieval on composite keys. Acta Information, 4:1–9, 1974.
[11] L. Galanis, Y. Wang, S. Jeffrey, and D. DeWitt. Locating data sources in large distributed systems. In VLDB, pages
874–885, 2003.
[12] A. Halevy, Z. Ives, P. Mork, and I. Tatarinov. Piazza: Data management infrastructure for semantic web applications.
In Proceedings of the Intenrnational Conference on the World Wide Web, 2003.
[13] A. Halevy, Z. Ives, D. Suciu, and I. Tatarinov. Schema Mediation in Peer Data Management System. In ICDE, 2003.
[14] V. Hristidis and Y. Papakonstantinou. DISCOVER: keyword search in relational databases. In VLDB, 2002.
[15] Institute for Systems Biology. http://www.systemsbiology.org.
[16] T.S. Jayram, Subhash Khot, and Ravi Kumar Yuval Rabani. Cell-probe lower bounds for the partial match problem. In
Annual ACM Symposium on Theory of Computing (STOC), 2003.
[17] A. Kementsietsidis, M. Arenas, and R. J. Miller. Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic
Issues. In SIGMOD, pages 325–336, 2003.
[18] Jon M. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In Twenty-ninth annual ACM
symposium on Theory of computing, pages 599–608, 1997.
[19] Stephen H. Koslow. Should the neuroscience community make a paradigm shift to sharing primary data? Nature
Neuroscience, 3(9):863–865, September 2000.
[20] Tanu Malik and Alex Szalay. Skyquery: A web service approach to federate databases. In Proceedings of CIDR, 2003.
[21] R. J. Miller, L. M. Haas, and M. Hernández. Schema Mapping as Query Discovery. In VLDB, pages 77–88, Cairo,
Egypt, September 2000.
[22] Nature Neuroscience Editorial. A debate over fmri data sharing. Nature Neuroscience, 3(9):845–846, Sep 2000.
[23] NIH statement on sharing research data. http://grants2.nih.gov/grants/policy/data sharing/index.htm, March 2002. U.S.
National Institutes of Health.
[24] Ronald L. Rivest. Partial-match retrieval algorithms. SIAM J. Comput., 5(1):19–50, 1976.
[25] S.P.Ghosh and C. T. Abharam. Application of finite geometries in file organization for records with multiple-valued
attributes. IBM J. Res. Develop., 12(2):180–187, 1968.
[26] Alexander S. Szalay, Jim Gray, Ani Thakar, Peter Z. Kunszt, Tanu Malik, Jordan Raddick, Christopher Stoughton, and
Jan vandenBerg. The sdss skyserver: public access to the sloan digital sky server data. In SIGMOD, pages 570–581,
2002.
[27] Jeffrey D. Ullman. Principles of Database and Knowledgebase Systems I. Computer Science Press, Rockville, MD
20850, 1989.
[28] H. Wang, S. Park, W. Fan, and P. Yu. ViST: a dynamic index method for querying XML data by tree structures. In
SIGMOD, pages 110–121, 2003.
[29] Roger Weber, Klemens Böhm, and Hans-J. Schek. Interactive-time similarity search for large image collections using
parallel VA-files. Lecture Notes in Computer Science, 1923:83–??, 2000.
[30] Roger Weber, Klemens Böhm, and Hans-J. Schek. Interactive-time similarity search for large image collections using
parallel VA-files. Lecture Notes in Computer Science, 1923:83–??, 2000.
[31] Roger Weber, Hans-Jörg Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search
methods in high-dimensional spaces. In Proc. 24th Int. Conf. Very Large Data Bases, VLDB, pages 194–205, 24–27
1998.
[32] Sumiyasu Yamamoto, Shinsei Tazawa, Kazuhiko Ushio, and Hideto Ikeda. Design of a balanced multiple-valued fileorganization scheme with the least redundancy. ACM Transactions on Database Systems (TODS), 4(4):518–530, 1979.