Structural XML Query Processing
Structural XML Query Processing
Structural XML Query Processing
RADIM BAČA and MICHAL KRÁTKÝ, Department of Computer Science, VŠB – Technical University
of Ostrava, Czech Republic
IRENA HOLUBOVÁ, MARTIN NEČASKÝ, TOMÁŠ SKOPAL, and MARTIN SVOBODA,
Faculty of Mathematics and Physics, Charles University, Czech Republic
SHERIF SAKR, King Saud bin Abdulaziz University for Health Sciences, Saudi Arabia University of New
South Wales, Sydney, Australia
Since the boom in new proposals on techniques for efficient querying of XML data is now over and the
research world has shifted its attention toward new types of data formats, we believe that it is crucial to
review what has been done in the area to help users choose an appropriate strategy and scientists exploit 64
the contributions in new areas of data processing. The aim of this work is to provide a comprehensive study
of the state-of-the-art of approaches for the structural querying of XML data. In particular, we start with
a description of labeling schemas to capture the structure of the data and the respective storage strategies.
Then we deal with the key part of every XML query processing: a twig query join, XML query algebras,
optimizations of query plans, and selectivity estimation of XML queries. To the best of our knowledge, this is
the first work that provides such a detailed description of XML query processing techniques that are related
to structural aspects and that contains information about their theoretical and practical features as well as
about their mutual compatibility and general usability.
CCS Concepts: • Information systems → Database query processing; XML query languages; Data
access methods;
Additional Key Words and Pharses: XML, structural XML query processing
ACM Reference format:
Radim Bača, Michal Krátký, Irena Holubová, Martin Nečaský, Tomáš Skopal, Martin Svoboda, and Sherif Sakr.
2017. Structural XML Query Processing. ACM Comput. Surv. 50, 5, Article 64 (September 2017), 41 pages.
https://doi.org/10.1145/3095798
1 INTRODUCTION
The adoption of the eXtensible Markup Language (XML) (Bray et al. 2008) proposed by the W3C1
as a standard for information exchange has gained great momentum and currently is undoubtedly
1 http://www.w3.org.
This work was partially supported by the Czech Science Foundation project Nr. GAČR 15-08916S, it was conducted within
the ENET project CZ.1.05/2.1.00/19.0389 and Student Grant Competition project with reg. no. SGS SP2017/157.
Authors’ address: R. Bača and M. Krátký, Department of Computer Science, Faculty of Electrical Engineering and Com-
puter Science, VŠB – Technical University of Ostrava, 17. listopadu 15, 708 33 Ostrava – Poruba, Czech Republic; emails:
{radim.baca, michal.kratky}@vsb.cz; I. Holubová (Mlýnková), M. Nečaský, T. Skopal, and M. Svoboda, Department of Soft-
ware Engineering, Faculty of Mathematics and Physics, Charles University, Malostranské nám. 25, 118 00 Praha 1, Czech
Republic; emails: {holubova, necasky, skopal, svoboda}@ksi.mff.cuni.cz; S. Sakr, Software Systems Research Group, Na-
tional ICT Australia (NICTA), ATP lab, Sydney, Australia; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2017 ACM 0360-0300/2017/09-ART64 $15.00
https://doi.org/10.1145/3095798
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:2 R. Bača et al.
the main standard for the representation and exchange of data. Before this happened, we witnessed
a massive boom in techniques that enable efficient storing and querying of XML data. Later, there
appeared two competing strategies: (1) XML-enabled database systems (DBS), typically represented
by the introduction of XML extensions to relational database management systems (RDBMS). For
example, XML is currently used as a data type in most well-known RDBMSs (e.g., Oracle,2 IBM
DB2,3 Microsoft SQL Server4 ). (2) Native XML database systems (or XDBMSs), represented by DBSs
(e.g., MarkLogic,5 Sedna6 ) that have been mainly designed and built for managing XML data and
supporting XML-related technologies (e.g., XPath, XSLT, XQuery). While the opponents of the
former claim that storing hierarchical and semi-structured XML data in flat relations causes inef-
ficiencies in XML query processing, the opponents of the latter emphasize that building a native
XDBMS from “scratch”—including all the “side effects” such as multiuser access, transactions, log-
ging etc.is nontrivial and cannot beat the theoretical and implementational maturity of RDBMSs.
History has shown that the truth is somewhere in between. While efficient XML data querying
requires specific data structures and algorithms, universal RDBMSs (and their robust implementa-
tions) bring the advantage of reasonably efficient processing of any kind of data, including XML.
In practice, XML DBSs have been actively used in several scenarios and application domains.7
For example, modern content management systems (e.g., Drupal,8 Joomla!,9 WordPress10 ) are heavy
users of XML storing, querying, and publishing techniques at their backends. However, recently,
we have been witnessing that the boom in the proposals of new techniques for efficient querying
of XML data is now over and the research world has shifted its attention toward other kinds of
data models and data formats (e.g., JSON (Bray 2014), NoSQL (Sakr et al. 2011), RDF (Beckett and
McBride 2004), or linked data (Bizer et al. 2009)). Therefore, we believe that it is time to review
what has been done in the area. Such an overview, classification, and comparison of approaches
can help users choose an appropriate strategy for a particular application and help researchers
learn more about what has been done and can possibly be exploited in new areas of data process-
ing that are closely related to the XML data model. For example, JSON is commonly described
as a fat-free alternative to XML; it follows the tree structure of the XML data model. However, it
is simpler and more closely matches the data models of most programming languages. In prac-
tice, current NoSQL document DBSs (e.g., MongoDB,11 CouchDB,12 or Azure DocumentDB13 ) use
JSON documents as their main storage unit. XML is also one of the main serialization formats
for the RDF data model and linked data. Therefore, various JSON and RDF storing and querying
techniques (Sakr and Al-Naymat 2010; Faye et al. 2012) and other tree-structured data (e.g., file
systems, parse trees for linguistic data and the secondary structure of RNA) can effectively uti-
lize the query processing techniques of XML data. In addition, various research efforts have been
proposed to bring the XML and RDF worlds closer (Droop et al. 2008), for instance, by querying
XML data using the SPARQL language (Bikakis et al. 2009), the standard query language for the
2 https://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb01int.htm.
3 http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.xml.doc/doc/c0022308.html.
4 https://msdn.microsoft.com/en-us/library/hh403385.aspx.
5 http://www.marklogic.com/.
6 http://www.sedna.org/.
7 https://www.ibm.com/developerworks/community/wikis/home?lang=en_us#!/wiki/pureXML/page/Success%20Stories.
8 https://www.drupal.org/.
9 https://www.joomla.org/.
10 https://wordpress.com/.
11 https://www.mongodb.com/.
12 http://couchdb.apache.org/.
13 https://azure.microsoft.com/en-us/services/documentdb/.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:3
RDF model,14 or by translating XPath queries into SPARQL queries (Droop et al. 2007). Other pro-
posals suggested using the XQuery language for querying JSON data (Robie et al. 2012; Bamford
et al. 2009) and NoSQL stores (Valer et al. 2013). According to Gartner (Feinberg and Ronthal 2016),
there is an increasing trend of a new generation of multimodel database systems (e.g., OrientDB,15
MarkLogic,16 or HPE Vertica17 ) that are designed to support storing data in a combination of related
models and query across them.
In this article, we provide a comprehensive survey of the state of the art of approaches and
related aspects for efficient structural XML data querying. In particular, we start with a description
of labeling schemes to capture the structure of the data and the respective storage strategies. Then
we deal with the key part of every query processing algorithm—a twig query join—as well as
optimizations of XML query processing. In particular, we summarize the main contributions of
this article as follows:
(1) We present a detailed description of up-to-date storage and indices for XML data, as well
as a classification of the physical access methods with regard to node labeling, document
partitioning, and twig query joins used.
(2) We provide a thorough description of the state-of-the-art twig query joins and their com-
parison in terms of their compatibility, features, and supported query models.
(3) We describe XML query algebras and outline their compatibility with twig query joins.
(4) We also discuss the main aspects of cost-based optimization techniques and selectivity
estimation approaches for XML queries. We compare these techniques in terms of sup-
ported query models, twig query join algorithms, and several other practical features as
well.
The rest of this article is structured as follows: In Section 2, we briefly describe the model of
XML data, current XML query languages, as well as labeling schemes that are commonly used in
indexing methods. Section 3 describes major storage and indices for XML data since they funda-
mentally influence the efficiency of query processing. In Section 4, we study the twig query join
problem, and, in Section 5, we deal with XML query algebras. In Section 6, we focus on optimiza-
tions and selectivity estimations of XML query processing. Finally, the conclusion of the article
and future work can be found in Section 7.
14 https://www.w3.org/TR/rdf-sparql-query/.
15 http://orientdb.com/orientdb/.
16 http://www.marklogic.com/.
17 http://www.vertica.com/.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:4 R. Bača et al.
Note that we ignore references (i.e., IDREF or IDREFS attributes) when considering the XML tree
model (Polyzotis et al. 2004a). In the following text, we use a term labeled path for a sequence of
node names of a path in an XML tree.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:5
other constructs. Thus, there exist different extensions of TPQ for handling various structural
aspects of XPath queries (Ning et al. 2008; Lu et al. 2005b). An example of the major TPQ extensions
is provided in Figure 2 (Q2–Q4). We use a notation for TPQ extension names, where a structural
construct not belonging to a basic TPQ definition is usually written as a superscript of a TPQ name
(i.e., Q2 and Q4).
Algorithms evaluating a TPQ on an XML document are called twig query joins or structural joins.
Any twig query join finds all the occurrences of a TPQ in an XML tree (also called a TPQ matching).
We discuss these in Section 4. Note that TPQ matching is often considered a core problem of XML
querying.
The XQuery language (Boag et al. 2010) extends XPath in many ways. Two important extensions
are the capability to specify output query nodes and an introduction of sequences in values. If we
use only the TPQ model, we would have to perform a TPQ output postprocessing in order to get
an expected XQuery output. To overcome this problem, a generalized query pattern (GTP) (Chen
et al. 2003) is introduced making such postprocessing unnecessary. Figure 3 shows an example of
three XQuery queries with their corresponding GTP models. We use a simplified notation of the
GTP model introduced in Chen et al. (2006), which exploits a circle to mark an output query node,
a square to mark an optional output query node, and a dashed edge to represent an optional rela-
tionship. The optional edge and the optional output node enable a simple representation of the let
clause or an optional XPath construct in the return clause (see Q6 and Q7 for an example). More-
over, values in query matches corresponding to the optional output nodes are (possibly empty)
sequences which reflects the XQuery data model. The simplified GTP model does not consider the
content constructs; it focuses only on the structural constructs of XQuery.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:6 R. Bača et al.
Fig. 4. (a) Containment labeling scheme. (b) Dewey order labeling scheme.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:7
— Insert: Depicts the difficulty of a node insertion operation. Insertion time is considered
“None” if there is no need to update labels after an insertion. The insertion time is consid-
ered “Low” if the following-sibling nodes with respect to the new node and their subtrees
need to be relabeled in the worst case. The insertion time is considered “Med” if we need to
relabel the whole tree in the worst case. (There is a gap technique that can accommodate
insertions in most cases while still preserving storage complexity.) The insertion time is
considered “High” otherwise.
— Storage (Stor.): Depicts the space complexity of each labeling scheme mainly with respect
to the number of nodes N .
Clearly, the prefix-based labeling schemes have fast insertions, but this is balanced out by the
cost of basic operations. Therefore, there is no holy grail of labeling schemes, and a tradeoff needs
to be made between the performance of basic operations or the performance of updates. When we
select an appropriate labeling scheme, we have to also consider LCA operation support because it
is required by various twig query joins (see Section 4.2.3).
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:8 R. Bača et al.
can be divided as follows: (1) Those based on a document structure (Zhang et al. 2001; Goldman
and Widom 1997; Chen et al. 2007; Kaushik et al. 2002a), and (2) those considering a typical XML
query workload (Kaushik et al. 2002b; Chen et al. 2003). Tag partitioning (Zhang et al. 2001; Al-
Khalifa et al. 2002) is the most common partitioning based on a document structure. Nodes are
divided according to their tags, and the number of partitions is equal to the number of unique
tags in an XML document. A tag+level (TL) (Chen et al. 2005) is a decomposition of the tag par-
titioning. Labeled path partitioning (Yoshikawa et al. 2001; Chen et al. 2005; Krátký et al. 2004) is
another common partitioning approach, sometimes also called prefix path streaming (PPS) (Chen
et al. 2005). Two nodes are equivalent in the labeled path partitioning if they have the same root-
to-node labeled path. The forward & backward partitioning (F&B) introduced by Kaushik et al.
(2002a) has the highest number of partition labels. Two nodes are equivalent in F&B if they have
the same root-to-node labeled path and all node-to-leaf labeled paths. Each of these partitionings
is actually a decomposition (refinement) of another one, as illustrated in Figure 5. On one hand, tag
partitioning produces a low number of partitions with a high number of nodes in each partition.
On the other hand, F&B provides the opposite property: A possibly high number of partitions with
a low number of nodes per each partition. We can also use a more refined partitioning approach
to speed up query processing. For example, labeled path partitioning allows us to resolve a simple
path expression without any AD relationships by a single random access into a node index (see
Section 3.1.3). Properties of different partitioning methods are discussed in detail in Section 3.1.4.
Another type of partitioning is semantic partitioning (Alghamdi et al. 2014) that partitions the XML
document according to the structure specified in an XML schema. Such an approach is similar to
labeled path partitioning in many cases; however, using semantic partitioning, we give a user the
possibility to influence the way we partition the data.
3.1.2 Schema Tree. A schema tree for an XML document is a labeled tree in which each labeled
path occurs only once. An example of a schema tree is depicted in Figure 15(b). In the literature,
there are many names for this tree: DataGuide (Goldman and Widom 1997), path tree (Aboulnaga
et al. 2001), summary tree, summary index, path index, and so on. A schema tree is useful for the
following purposes: (1) to determine partition labels corresponding to a query (Bača and Krátký
2008) when a more refined partitioning is used (e.g., tag+level, or labeled paths), (2) to support a
simple selectivity estimation (Aboulnaga et al. 2001), and (3) to get general knowledge about an
XML document structure (e.g., to support an auto-completion feature in an XML editor).
3.1.3 Node Indices. Node labels are not stored in a schema tree structure; instead, they are
stored in a node index as values. There are two basic types of node indices: (1) those having a node
id as a key (document indices), and (2) those having a partition label as a key (partition indices).
Note that the document index has node id as a key and node label as a value, which can be almost
the same information in some labeling schemes. However, some other information about the
node (i.e., pointer to the node content) can be stored in the value of the document index. Nodes
corresponding to one partition label are sorted according to the node label in the partition index;
therefore, a sequential scan of the key’s list is usually necessary during query processing. This
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:9
Fig. 6. (a) Document index. (b) Partition index. (c) Partition index with indexed lists.
can be improved by an index (XB-tree (Bruno et al. 2002) or XR-tree (Jiang et al. 2003a)) built over
each list. All types of node indices are schematically depicted in Figure 6.
From the query processing perspective, a document index is very useful when we have a small
set of context nodes and we want to use it to process the remaining relationships of a query.
This type of query processing can be considered navigational and is thoroughly described in Sec-
tion 4.1.2. On the other hand, many twig query joins are based on the partition index (see Section 4).
This type of join is mainly focused on the merge that occurs during one sequential scan of lists that
removes irrelevant nodes (Bruno et al. 2002; Al-Khalifa et al. 2002). “List” in the partition index is
called stream in the join, and we use the term sequence in the following text.
A specific category of node indexes address fast XML updates (Finis et al. 2016, 2013; Silberstein
et al. 2005). These works use an index with a structure similar to a document index. They use
physical node position instead of a node label, and they provide much better update performance;
the specific type of update operation (such as a tree relocation) has logarithmic complexity instead
of linear, as in the case of all labeling schemes mentioned in Section 2.3. On the other hand, they
do not offer the query performance provided by the partition index.
3.1.4 Balancing XML Storage. From the query processing point of view, the selection of parti-
tioning influences the size of three problems: (1) the overall amount of nodes that have to be read
from a node index, (2) the number of random accesses into a node index, and (3) the necessity to
find all the partition labels in a potentially large schema tree. If we minimize the first problem by
using more refined partitioning, then the latter two problems of query processing increase and
vice versa. This behavior also depends on query workload. For example, the F&B partitioning al-
lows us to process a TPQ without AD edges by a single random access to a node index. On the
other hand, the number of partition labels can be very close to the overall number of nodes in an
XML tree, and it can take a long time to find them. In such a case, the problem of TPQ matching
is shifted to the problem of finding all the partition labels in a schema tree, which is an identical
problem and so the same algorithms can be used (Goldman and Widom 1997).
As mentioned in Section 3.1.1, there are also partitioning techniques that are based on a typical
XML query workload (Kaushik et al. 2002b; Chen et al. 2003). These approaches take into account
typical queries, and the partitioning is created with respect to them. They are parameterized by a
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:10 R. Bača et al.
constant k (A(k )-index (Kaushik et al. 2002b), D (k )-index (Chen et al. 2003)), where k represents the
length of a labeled path that is considered during the partitioning. The number of partition labels
increases with increasing k. These approaches minimize all three query processing problems of
partitioning mentioned earlier; however, they are only effective for a specific workload.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:11
Sequence API
next(): node
finished(): bool
Table 2, and consist of two parts: (1) a navigational API and (2) a stream API. Each method in
these APIs returns a sequence of nodes which can be manipulated according to a sequence API
(see Table 3). Each method in the navigational API returns a sequence that corresponds to an
input partition label and satisfys an XPath axis with an input node. It should be noted that some
approaches also implement other XPath axes such as following or preceding in the navigational
API (Grust 2002). The stream API method returns a sequence corresponding to an input partition
label, and the sequence API contains methods to manipulate a sequence.
Operations in the APIs have significantly different performance characteristics depending on
available node indices (see Section 3.1.3). In general, the document index supports both the navi-
gational and stream APIs, whereas the partition index supports only the stream API. However, the
document index has to access nodes using a sequential scan of a B-tree containing many irrelevant
nodes when processing the stream API, and this is inefficient. A query processed in the partition
index accesses only nodes relevant to a partition label; therefore, it is crucial for the efficient pro-
cessing of all merge-like algorithms (see Sections 4.1.1 and 4.2). The selection of an appropriate
index and algorithm is a task for a cost-based optimizer (see Section 6.1).
Before we describe particular algorithms, we specify the general steps that can be identified as
parts of any twig query join as follows:
— Filtering: An algorithm scans sequences and filters out nodes not corresponding to any
query match. In this stage, algorithms use main memory filtering data structures to get rid
of the maximum number of nodes that are irrelevant to a query. The most common filtering
data structure is a stack or a set of stacks. The main feature of the filtering data structure
is that it can help to decide whether the node is useless or not in constant time. Of course,
every filtering can have false hits.
— Intermediate storage: Every join uses some type of intermediate storage where nodes are
stored before a query output is enumerated. In some approaches, a filtering data structure
is used as intermediate storage as well.
— Output enumeration: In this step, algorithms read the intermediate storage and generate
an output that is usually in a form of ordered tuples. The major task of this step is often
tuple ordering according to a query model.
One of the most significant differences among joins is the method used during the filtering step.
The first type of filtering methods focuses only on a pair of query nodes, and the twig query join
is processed as a set of binary structural joins. By contrast, holistic joins filter input nodes based on
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:13
Pros Cons
Binary structural joins Easily integrated into any Efficiency is significantly
XML query algebra and dependent on the selection of
support all XPath axes a good query plan. They can
produce a large intermediate
result when compared to the
query output.
Top-down holistic joins Linear I/O complexity of A sequential scan of an
query processing with intermediate result is
respect to the sum of output required even if it contains
and input sizes for some many useless nodes (see
query types and unnecessary Section 4.2.1).
query plan optimizations
Bottom-up holistic Linear CPU and I/O They can produce a large
joins complexities of the output intermediate result compared
enumeration with respect to to the query output (see
the output size Section 4.2.2).
information from all query nodes. Figure 7 summarizes the major categories of twig query joins,
and Table 4 depicts their main features.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:14 R. Bača et al.
Tree-Merge differ depending on the axis being resolved by the algorithm. In the case of the AD
relationship, the worst-case time and space complexities are linear with respect to the sum of the
input sequence and the output data sizes. In the case of the PC relationship, the time complexity
degrades. This is because Tree-Merge must work in the same way as for AD relationships, with
a difference being that each output pair is checked to determine whether it satisfies the PC rela-
tionship. Therefore, we can possibly check many irrelevant pairs in the case of a PC relationship.
The problem with PC edges is common to many twig query joins, as we will show in Section 4.2.
Moreover, Tree-Merge also has a problem with I/O complexity when two or more nodes in the
sequence have AD relationships. In that case, Tree-Merge must read certain parts of a sequence
repeatedly. Generally, Tree-Merge is rather inefficient when an input XML document contains re-
cursive data, even for AD relationships, since in such a case the algorithm performs repeated scans
of the sequences.
Al-Khalifa et al. also proposed another algorithm, called Stack-Tree, to solve the I/O complexity
problem of Tree-Merge. Stack-Tree exploits a stack that prevents repeated comparisons; therefore,
Stack-Tree reads both input sequences only once. The concept of stacks has fundamentally influ-
enced most later approaches in the area, and holistic algorithms are built on similar principles as
well. Stack-Tree maintains a stack containing only nodes lying on the same root-to-leaf path. In
other words, each node of the stack is a descendant of each node below it. This prevents repeated
reads of input sequences and ensures a linear worst-case I/O complexity with respect to the sum
of the input and output sizes in the case of PC and AD relationships.
4.1.2 Navigational Algorithms. Binary structural join are another type of navigational algo-
rithm (Grust et al. 2003; Halverson et al. 2003; Zhang et al. 2005). This type of query processing per-
forms a navigational operation for each node from the input sequence. The major representative of
navigational algorithms is the StairCase join (Grust et al. 2003). Let us consider an input sequence
(context nodes) corresponding to a query node. This sequence is usually an intermediate result
created by previous query processing. The StairCase join is based on XPath accelerator (Grust
2002), which is a simple algorithm that iteratively searches for successors of every context node
while avoiding duplicate scans of XPath accelerator. Although the complexity boundaries of the
StairCase join have not been proved, this algorithm was shown to be fast on a real implementation,
MonetDB/XQuery (MonetDB BV 2008).
— Query match of Q in D is a tuple of nodes from D where: (1) each query node q ∈ Q has
exactly one corresponding node in the tuple, (2) each pair of nodes satisfies the XPath rela-
tionship between the corresponding query nodes.
— Weak query match of Q in D is defined in the same way as the query match of Q, but with
a difference being that a pair of nodes can only have an AD relationship even though the
corresponding query nodes have a PC relationship.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:15
Every weak query match is a relaxed version of the query match. If algorithm filtering detects
a weak query match with a node, it does not mean that a query match also exists. However, it is
much easier and faster to look only for weak query matches during the filtering step.
Holistic algorithms apply filtering to decide whether a node belongs to a (weak) query match
or not. We define the following types of nodes:
— Useless node nq : It is confirmed that no query match exists of Q containing nq .
— Weakly matched node nq : The existence of a weak query match of q with nq is confirmed.
— Matched node nq : The existence of a query match of a subquery rooted by q containing nq
is confirmed.
— Fully matched node nq : The existence of a query match of Q with nq is confirmed.
Note that any matched node can still be useless. We define the last type of a query match com-
monly evaluated during the filtering step of a join. Let Qp (q) be a set of query nodes on the path
from the root query node to a query node q. A prefix query match of q in an XML document D is
simply a query match of Qp (q) .
We recognize two main types of filtering approaches: (1) top-down filtering (Bruno et al. 2002;
Lu et al. 2004, 2005a), which skips most of the irrelevant nodes before they are stored on stacks,
and (2) bottom-up filtering (Chen et al. 2006; Qin et al. 2007), which skips most of the irrelevant
nodes when they are popped out from their stacks. Top-down filtering has a significant feature:
For some types of queries, it stores only fully matched nodes in a filtering data structure and inter-
mediate storage. We consider that a holistic algorithm is optimal in such a case. The major features
of an optimal holistic algorithm are linear worst-case time and I/O complexities with respect to the
sum of the input and output data sizes. Holistic algorithms utilizing top-down filtering are thor-
oughly described in Section 4.2.1. Bottom-up filtering optimizes another important aspect of the
holistic algorithm: output enumeration. We describe holistic algorithms using bottom-up filtering
in Section 4.2.2.
4.2.1 Top-Down Approaches. As mentioned in Section 4.2, TwigStack and PathStack represent
the first holistic approaches. They use one stack for each query node as a filtering data structure,
similarly to Stack-Tree. Nodes stored in the same stack all have AD relationships. Before we push
a node on a stack, we pop out all nodes that are not its ancestors. This behavior leads to linear
space complexity of filtering with respect to the depth of an XML document, which is typical for
all top-down approaches.
In general, PathStack is a very simple algorithm performing a prefix match filtering that filters
out every node nq not having any corresponding prefix query match of q. In other words, Path-
Stack considers only a query path to q, and, therefore, it works very efficiently for simple path
expressions (i.e., queries without predicates). On the other hand, PathStack can perform poorly
on twig queries. This is a strong field for TwigStack, because TwigStack considers the whole TPQ
during filtering. Before TwigStack checks the prefix match of q, it performs a weak match filtering
that filters out all nodes that are not weakly matched. The weak match filtering is represented by
a filtering function called getNext. Bruno et al. prove that TwigStack is optimal if TPQ contains
only AD relationships. In other words, for such a query, getNext returns only matched nodes, and
only fully matched nodes are stored on a stack. The getNext filtering function further introduces
an unnecessary node skipping that allows it to possibly skip many nodes in a sequence at the same
time. This feature can be used effectively when sequences are indexed (see Section 4.3).
Another important aspect of each holistic algorithm is output enumeration. Many top-down al-
gorithms use the output enumeration proposed by TwigStack. TwigStack filtering generates root-
to-leaf path query matches called solutions for each root-to-leaf query path in a TPQ. We need to
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:16 R. Bača et al.
Fig. 8. Example of TwigStack’s iterations performed during the query processing of Q8 and Q9 on the XML
tree.
sort the solutions in order to support their efficient output enumeration. An intermediate stor-
age can contain many duplicate nodes, and its size can be even bigger than the size of an input
XML tree in the worst case. Moreover, the enumeration sequentially reads arrays, which results
in unnecessary overhead when many irrelevant nodes are stored in intermediate storage. These
problems are addressed by the bottom-up algorithms discussed in Section 4.2.2.
The basic principles of TwigStack are depicted in Figure 8. Each iteration of TwigStack calls the
filtering function that returns a query node q, and TwigStack stores a head node of the sequence
Tq on the stack Sq if a prefix match exists (note that the head node of Tq has to be weakly matched
due to the filtering function call). Solutions are produced every time we push a node corresponding
to the leaf query node (i.e., 4th , 5th , and 7th iterations) on a stack. Actions performed during each
iteration are the same whether we process Q8 or Q9 because weak match filtering considers every
PC relationship as AD. As a result, the solutions produced in the example are irrelevant for Q9,
which is resolved during enumeration.
The TwigStack algorithm is optimal for a query containing only AD relationships. The weak
match filtering of the algorithm can push useless nodes onto a stack when a query contains a
PC relationship, as demonstrated in the example. As noted in Choi et al. (2003), the optimality
of weak match filtering can be improved in two ways: (1) by multiple scans of a sequence and
(2) by relaxing the data storage model (e.g., using a more refined XML document partitioning, as
described in Section 3.1.1). Choi et al. showed that the cost of the filtering that would be optimal for
any TPQ is too high. Therefore, algorithms improving the optimality of weak match filtering do
that only for a subset of twig queries. TwigStackList (Lu et al. 2004) represents the first instance of
optimality improvement (i.e., it allows multiple scans of nodes in a sequence), and iTwigJoin+PPS
and iTwigJoin+TL (Chen et al. 2005) represent the second instance (i.e., they relax the data storage
model). The iTwigJoin algorithm is an elegant combination of the holistic join and refined XML
document partitioning introduced in Section 3.1.1.
Figure 9 contains the holistic algorithms for improving TwigStack optimality, and they introduce
other TPQ types that are optimally processed. Arrows in Figure 10 represent techniques utilized
to enhance the optimality of each algorithm. The TJFast (Lu et al. 2005a) and TreeMatch (Lu et al.
2010) algorithms are described in Section 4.2.3. In general, the optimality of a holistic algorithm
does not have to be related only to a query type. Bača and Krátký (2009) and Bača et al. (2012) in-
troduced optimality based on XML document features as well. This approach shows that a holistic
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:17
Fig. 9. TPQ types and algorithms that are optimal for their processing.
algorithm using weak match filtering can be optimal under tag+level or labeled path partitioning if
an XML document does not contain any recursive nodes. They also show that, for many common
XML collections, TwigStack is optimal if labeled path partitioning is used.
4.2.2 Bottom-Up Approaches. As mentioned earlier, the optimality of a holistic algorithm is not
always the main criterion for its efficiency. Another important aspect of a holistic algorithm is fast
output enumeration. All top-down algorithms use TwigStack output enumeration, which is not
very efficient (as mentioned in Section 4.2.1). Another problem of TwigStack output enumeration
is that it returns the result in the form of query matches; they do not reflect information about
output query nodes, and they need additional post-processing to get the output expected by an
XQuery.
These problems are addressed by the Twig2 Stack algorithm (Chen et al. 2006), which ensures a
linear output enumeration with respect to the result size and works with the GTP query model.
Twig2 Stack uses a filtering data structure called a hierarchical stack (HS) and subtree match filtering
that filters out every node that is not matched. That is true regardless of query type. The HS also
serves as an intermediate storage, and, therefore, nodes are pushed onto the HS in postorder. A
mechanism of a node push and the structure of an HS is demonstrated in Figure 11(a). The example
contains two situations: before the push of node a 1 and after the push of node a 1 . The enumeration
algorithm starts with an HS corresponding to the root query node, and only nodes relevant to the
result are accessed during output enumeration.
Compared to the top-down approaches, Twig2 Stack has the following two disadvantages: (1) its
subtree match filtering accesses nodes in an HS that are descendants of the currently evaluated
node, and (2) before the node is pushed onto an HS, the algorithm does not check its prefix query
match (i.e., prefix match filtering is missing). The first disadvantage means that the space complex-
ity of the filtering is linear with respect to the XML tree size, and Twig2 Stack can spend additional
time merging many nodes during the push onto an HS. The second disadvantage is that it pushes
many useless nodes onto HSs. For example, all nodes corresponding to leaf query nodes are pushed.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:18 R. Bača et al.
Fig. 11. (a) Example of Twig2 Stack’s hierarchical stacks created during the query processing of the TPQ on
the XML tree. (b) TwigList’s lists.
Therefore, Twig2 Stack proposed an improvement where the PathStack’s prefix match filtering is
applied before a node is pushed onto an HS. However, many useless nodes are still pushed onto
an HS (see PathStack in Section 4.2.1).
The first disadvantage of Twig2 Stack is improved by the TwigList algorithm (Qin et al. 2007) by
using an approach where an HS is replaced by a simple array. TwigList needs only the last node
inserted into each array during the subtree match filtering; another advantage is that the array
can be easily paged on a secondary storage if needed. Nodes in the array are stored in postorder,
which means that the result has to be subsequently postprocessed when we require a result pre-
sented according to document order. When a node nq is appended to the end of the array, it is
also associated with a pointer to its first and last descendants corresponding to each child of q.
We show the arrays of TwigList in Figure 11(b). The arrays are much simpler than HSs, which
simplifies their maintenance and allows persistence. One disadvantage of TwigList’s intermediate
storage is the suboptimal handling of PC edges, leading to the fact that TwigList stores even more
irrelevant nodes in intermediate storage than Twig2 Stack. However, this behavior is simply cor-
rected in Grimsmo et al. (2010) by the introduction of level split vectors that represent an improved
TwigList’s intermediate storage.
4.2.3 Combined Approaches. The main disadvantage of the bottom-up approaches, when com-
pared to top-down approaches, is represented by insufficient filtering before a node is inserted into
intermediate storage. A possible solution is a combination of top-down filtering and bottom-up
output enumeration. Basically, Twig2 Stack+PathStack, introduced in Chen et al. (2006), represents
such a solution; however, PathStack uses only prefix path filtering, and the number of irrelevant
nodes in intermediate storage can still be very large. Another combined approach is represented
by the TJStrictPost and TJStrictPre algorithms (Grimsmo et al. 2010).
TJStrictPost uses one global stack for postorder node processing and one stack per query node
for prefix query match filtering. Therefore, each node is stored twice in filtering data struc-
tures. TJStrictPost applies prefix match filtering before a node is inserted into the filtering data
structures. Therefore, the number of nodes inserted into intermediate storage is equal to that of
Twig2 Stack+PathStack. Whether a node is useless or matched is decided when it is popped out
from the global stack, and the matched node is then inserted into intermediate storage. As a result,
nodes are stored in level split vectors in postorder, which means that output enumeration produces
an output that has to be sorted.
TJStrictPre uses a combination of weak match filtering and level split vectors. It introduces an
improved filtering function called getPart. It remembers the last weakly matched node for each
query node returned by the previous getPart call, which allows faster useless node skipping when
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:19
Compatibility
Algorithm Type Filtering Query model category
TwigStack Weak match + Prefix TPQ A
iTwigJoin Weak match + Prefix TPQ A
XTwigStack Top-down Weak match + Prefix X-TPQ A
TwigStackList Weak match + Prefix TPQ B
TJFast Prefix + Weak match TPQ∗ C
Twig2 Stack Subtree match GTP D
Twig2 Stack + Prefix + Subtree match GTP D
PathStack Bottom-up
TwigList Subtree match TPQ D
TJStrictPost Prefix + Subtree match TPQ D
TJStrictPre Weak Match + Prefix TPQ A, D
TreeMatch Combined Prefix + Weak match TPQ >,∗ C
GTPStack Weak match + Prefix + GTP A, D
Subtree match
compared to TwigStack’s getNext filtering function. TJStrictPre stores nodes in its intermediate
storage as soon as they are pushed onto the filtering data structure (a stack of query nodes). There-
fore, nodes in level split vectors are stored in preorder, and the output after output enumeration
does not have to be sorted. However, two passes through level split vectors are needed.
The TreeMatch algorithm (Lu et al. 2010) is an extension of the TJFast algorithm (Lu et al.
2005a) for a larger set of query constructs and with optimizations related to output query nodes.
TreeMatch and TJFast represent a family of algorithms that require a labeling scheme supporting
the LCA operation mentioned in Section 2.3. TreeMatch includes support for TPQ∗, > and semantics
related to boolean expressions. Moreover, TreeMatch applies subtree match filtering. TreeMatch
stores nodes on the ancestor’s stacks until it is certain that the nodes correspond to an output.
This approach avoids the solution sorting used by TJFast and enables TreeMatch to be optimal for
queries having only ADs in nonoutput branching edges.
GTPStack (Bača et al. 2012) is another approach combining top-down and bottom-up filtering.
It is capable of processing a GTP with linear worst-case I/O complexity with respect to the GTP
result size. It further improves the getPart filtering function by using dynamic programming to
avoid unnecessary calls, which reduces the number of holistic algorithm iterations. The number
of iterations performed in Figure 8 would be reduced to six since the first two iterations can be
performed in one getPart call. GTPStack uses level split vectors as intermediate storage for fast
output enumeration.
4.2.4 Comparison of Holistic Approaches. In Table 5, we compare the existing holistic ap-
proaches based on the following criteria:
— Filtering: This criterion shows filtering applied before a node is inserted into intermedi-
ate storage. We consider the worst-case scenario of the algorithm’s filtering. For example,
most top-down algorithms utilize weak match filtering in Table 5; however, when a top-
down algorithm is optimal for a query, then it performs match filtering (i.e., it filters out all
nodes that are irrelevant to a query). A comparison of filterings according to their selectiv-
ity is provided in Figure 12. We use the term “selectivity” to define how the method filters
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:20 R. Bača et al.
irrelevant nodes (high selectivity means that the method is more successful). The most com-
mon variant of filtering is a combination of weak match filtering (the getNext/getPart
filtering function) and prefix match filtering, while another common filtering is a subtree
match. The weak query match and prefix query match filtering combination is typical for
top-down approaches, and the subtree match filtering is typical of bottom-up approaches.
Note that a more selective filtering does not usually mean faster query processing due to
generated overhead. The major advantage of a more selective filtering is a lower number
of nodes in intermediate storage (often significantly lower). From this perspective, GTP-
Stack achieves the lowest number of nodes in intermediate storage due to its combination
of filterings.
— Query model: This feature specifies the query model supported.
— Compatibility: We assign a compatibility category symbol (A–D) to each holistic approach.
Two algorithms with the same compatibility category symbol can be combined into one
algorithm with minimal effort. Each algorithm is in one of the four categories in Table 5
except for TJStrictPre and GTPStack, which are combinations of algorithms from the A and
D categories; therefore, they can exploit the advantages of both.
It is not easy to compare holistic algorithms according to their real processing time using a
single criterion because each algorithm has its own difficulties. The major difficulty of the top-
down algorithms is their enumeration algorithm. Conversely, the problem of the bottom-up and
combined approaches is that they can store an enormous number of useless nodes in intermediate
storage. From this point of view, combined approaches seem to be closest to an optimal holistic
algorithm from the query processing time point of view.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:21
works (Jiang et al. 2003b; Fontoura et al. 2005) describe the optimization of useless node skipping
in a holistic join. The proposed algorithms look for a possibly longest jump that can significantly
decrease the number of I/O accesses as well as node label comparisons.
LCA operation. Section 2.3 depicts the possibility of processing the LCA operation in some la-
beling schemes. TJFast (Lu et al. 2005a), virtual cursor (Yang et al. 2004), TJDewey (Bača and Krátký
2009), and TreeMatch (Lu et al. 2010) utilize the LCA operation to avoid reading sequences corre-
sponding to non-leaf query nodes. Such an approach can avoid reading a lot of nodes; however,
node labels in such a labeling scheme can occupy quite a lot of space by themselves. Therefore,
a node label compression technique should be applied (Härder et al. 2007; Bača et al. 2010). The
advantages of the LCA operation can also be integrated into other algorithms. However, when
compared to an integration of partitioning, LCA operation support needs some changes in an al-
gorithm, and it is not often straightforward. Moreover, the assumption for the use of the LCA
operation and the avoidance of reading a non-leaf query nodes’ sequence is that the labeled path
for each node is accessible during an algorithm run. In practice, it can be difficult to satisfy this
assumption.
Boolean expressions. A number of works deal with TPQs containing OR and NOT con-
structs (Jiao et al. 2005; Jiang et al. 2004; Yu et al. 2006; Che et al. 2012; Izadi et al. 2012; Bača
et al. 2012; Lu et al. 2010). These works avoid a split of a TPQ into more queries. Generally, to sup-
port boolean expressions, we need to normalize a TPQ (Che et al. 2012) and include this support in
node filtering. Once a TPQ is normalized, it is easy to include boolean expressions into a filtering
function (Che et al. 2012) and also into subtree match filtering as well (Bača et al. 2012; Lu et al.
2010).
Data parallelization. There are works dealing with the parallelization of twig query join al-
gorithms (Pan et al. 2008; Machdi et al. 2009, 2010; Tian et al. 2013; Shnaiderman and Shmueli
2015). These works use so-called data partitioning, which means that they partition the sequences,
and each thread works with its partition. Note that this is a different type of partitioning than the
document partitioning depicted in Section 3.1.1. It is easy to integrate data partitioning into any
twig query join, and Shnaiderman and Shmueli (2015) shows that the speed-up can be surprisingly
more than linear with respect to the number of threads because of the ability to skip unnecessary
parts of sequences during partitioning.
5 XML ALGEBRAS
In Section 4, we introduced several algorithms for TPQ processing (i.e., twig query joins). However,
it might not be clear how the TPQ model can be incorporated into XML algebra. In this section,
we describe various XML algebras and their compliance with the TPQ model.
Relational algebra is one of the main components of RDBMSs and plays an important role in
their success and widespread usage. A corresponding XML algebra has the same importance and
plays a substantial role in XML query processing. Many proposals for XML algebras have been
introduced (Jagadish et al. 2001; Chen et al. 2003; Grust et al. 2004; Deutsch et al. 2004; Brantner
et al. 2005; Re et al. 2006; Özcan et al. 2008). In general, existing XML algebras fall into two main
classes depending on the data model they use:
— Tree-based algebra: This class of XML algebras represents XML data as forests of labeled
ordered trees (Jagadish et al. 2001; Al-Khalifa and Jagadish 2002; Chen et al. 2003) and pro-
vides a natural support for the twig query joins introduced in Section 4 and XML-specific
optimizations.
— Tuple-based algebra: This class of XML algebras operates similarly to relational algebra
on sets of tuples (Brantner et al. 2005; Deutsch et al. 2004; Grust et al. 2004; Re et al. 2006;
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:22 R. Bača et al.
Özcan et al. 2008) and extends it to meet the requirements of the XQuery data model. There
are two types of tuple-based algebras, depending on the tuple value allowed: (1) a value
can only be a null, an atomic value, or an XML node (Grust et al. 2004) (see Section 5.2.1);
and (2) a value can also be a sequence of values or a sequence/bag of tuple values (see
Section 5.2.2).
We summarize the basic categories of XML algebras in Figure 13 and describe them in more
detail in the following sections.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:23
that allow the query optimizer to employ relational algebraic optimization techniques such as
a pushdown of the selection operator. Figure 14 lists the Pathfinder algebraic operators that are
designed to receive one or more inputs and produce one or more outputs. These inputs and outputs
are sets of tuples. Most of the operators perform quite simple and standard relational operations
(e.g., selection, projection, cartesian product, join, union) that allow the query optimizer to employ
the usual relational algebraic optimization techniques (Sakr 2007). For example, the XPath Location
Step Evaluator operator applies the evaluation conditions of an XPath location step over the context
nodes and returns the result as a sequence of tuples corresponding to output nodes.
5.2.2 Nested Algebras. As mentioned earlier, there are some works (Brantner et al. 2005; Re
et al. 2006; Manolescu et al. 2009) that allow nonatomic values. As a result, such an algebra can-
not be integrated into a relational query processor as easily as the basic tuple-based algebras. The
major reason for these extensions of the basic algebras is the possibility of building more efficient
query plans. Brantner et al. (2005) presented NAL algebra (Natix Algebra) for an algebraic com-
pilation of an XPath expression supporting a sequence of tuple values as a value. NAL algebra is
developed in the context of the Natix project (Fiebig et al. 2002), and it introduces support for all
XPath constructs. It can easily incorporate only navigational binary structural joins mentioned in
Section 4.1.2. To overcome this problem, Mathis (2007) introduced an extension of NAL in order to
support the TPQ, and, therefore, all the algorithms described in Section 4. Re et al. (2006) present
an approach that allows a value to be a sequence of values. These sequences are called XML val-
ues, and they correspond to sequences of items in the XQuery data model. The algebra contains
three basic types of operators: (1) operators on XML values, (2) operators on tuples, and (3) oper-
ators allowing XML values to be mapped to tuples and the other way around. One of the newer
works (Manolescu et al. 2009) describes a unified algebra (UA) that is similar to that described in
another work (Re et al. 2006). UA is a combination of operators specific for XML values (i.e., a
sequences of XML items) and relational algebra. UA allows homogeneous sequences of tuples or
bags of tuples in a value.
All XML algebras mentioned in this section claim that it is easier to identify and apply promising
optimization techniques such as group by on complicated and nested XQueries. Another advantage
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:24 R. Bača et al.
is that we can incorporate all TPQ algorithms introduced in Section 4; we only need to identify a
TPQ in a query plan (Michiels et al. 2007).
— Language: This criterion specifies whether the algebra supports XPath or XQuery. More-
over, there are algebras supporting XQuery restricted only to the AD and PC relationships
(called AD-XQuery).
— Twig query joins: We analyze what kind of twig query joins the algebra supports. We use
the abbreviations BJ and Nav BJ for a binary join and a navigational binary join, respectively
(see Sestion 4.1). The “all” shortcut means that both twig query join types (i.e. binary and
holistic) are supported.
— Relational compliance: We analyze whether an algebra uses a data model fully compatible
with the relational data model or not. In other words, this criteria is just to determine if it is
possible to map the XML algebra operators on the relational operators without substantial
work.
— Major advantages: We highlight the major advantages of each approach.
As shown in Table 6, UA is the only algebra supporting XQuery and all the twig query joins.
However, if we apply a TPQ identification in a query plan (Michiels et al. 2007), then holistic twig
joins can be used in Galax algebra as well.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:25
Algorithm type
Merge-like Navigational
Binary BM (StackTree (Al-Khalifa et al. 2002)) B N (StairCase (Grust et al. 2003))
joins
Holistic H M (TwigStack (Bruno et al. 2002)) H M N (CostTwigJoin (Bača et al. 2015))
joins H M N (CostTwigJoin (Bača et al. 2015))
6.1.1 Logical Level Optimizations. XML algebras usually describe a set of possible rewritings
and promising query plan transformations (Re et al. 2006; May et al. 2004). The most significant
type of rewriting is unnesting. This is a well-known in RDBMSs that typically rewrites a depen-
dent nested query into an independent one which can subsequently lead to a significant query
processing speed-up. Wu et al. (2003) introduced dynamic programming algorithms searching for
an optimal binary structural join order. This approach is based on the fact that we can process
a TPQ though many equivalent query plans consisting of binary structural joins and sort opera-
tors. Che et al. (2006) introduced a set of algebraic XPath transformations for query optimizations
that are based on an extended PAT algebra (Salminen and Tompa 1994) originally developed for
search on semi-structured text. The algebraic transformations are based on an XML schema or
XML document properties and on the existence of the so-called structural index. The term “struc-
tural index” is used by Che et al. (2006) to describe a materialized join between a pair of XML
elements in an XML schema. However, structural index selection is not mentioned in the same
study even though it can be crucial to the technique’s applicability.
6.1.2 Physical Level Optimizations. Cost-based optimizations at the physical level focus mainly
on a selection between navigational and merge-like algorithms (see Section 4.1). However, a twig
query join (structural binary or holistic) selection is also important. Table 7 summarizes the basic
categories of twig query joins from the cost-based perspective, and for each category it gives one
typical algorithm.
Halverson, Burger, Galanis, Kini, Krishnamurthy, Rao, Tian, Viglas, Wang, Naughton, et al.
(2003), Zhang et al. (2005), and Georgiadis et al. (2010) introduced XML systems that are capable
of using both types of binary structural joins (B M and B N in Table 7) during the query processing.
Each XML system builds a query plan consisting of a set of binary structural joins, where an
appropriate type of join is selected according to the selectivity. All these works focus on XPath
processing, and they are based on a simple observation that a navigational algorithm is superior to
a merge-like algorithm if the percentage of useful nodes from an input sequence is low. This cost-
based optimization is similar to the selection of an appropriate equi-join algorithm in RDBMSs,
where we select between the merge join and the nested-loop join. Whereas the work by Halverson
et al. (2003) is one of the first in this field, Zhang et al. (2005) uses a cost-based model based on a
statistical regression. Georgiadis et al. (2010) introduced a data structure API similar to that defined
in Table 2. The API allows both types of indices to be abstracted (i.e., document and partition
indices depicted in Section 3.1) to a set of simple operations, and they evaluate several various
index configurations and physical operators.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:26 R. Bača et al.
Categories of twig
Approach join algorithms Major advantages
(Wu et al. 2003) BM Resolves optimal binary structural
join ordering
(Halverson et al. 2003) BM BN Baseline for other approaches
combining B M and B N
(Zhang et al. 2005) BM BN Adaptive to a query workload
(Georgiadis et al. 2010) BM BN Contains abstraction of basic
operations which makes the solution
general
(Weiner and Härder 2010) BM HM Integrates holistic joins into query
plans
(Bača et al. 2015) HM N Integrates navigational query
processing into a holistic join
Weiner and Härder (2009, 2010) considered the advantages of a combination of binary and
holistic joins. In Weiner and Härder (2009), we can find one of few direct comparisons between
binary and holistic joins. The authors search for a query selectivity threshold when a holistic
join becomes more efficient than a plan with binary structural joins. Their results indicate that
highly selective queries are performed more efficiently using holistic joins. Weiner and Härder
(2010) described the approach for using this knowledge to build a query plan combining binary
and holistic structural joins. Bača et al. (2015) incorporated the navigational query processing into
a holistic join. In this way, the search space of possible plans is reduced, and so an optimal query
plan selection can be simplified when compared to approaches based on binary joins.
Since there are many various XML indices, the authors of Elghandour et al. (2013) presented a
design advisor for recommending the indices and materialized views.
6.1.3 Comparison of Cost-Based Optimizations. Table 8 lists the major cost-based approaches
and shows the categories of join algorithms used by these approaches. In general, many approaches
are based on the idea that navigational algorithms are faster than merge-like algorithms if the
selectivity is high (Halverson et al. 2003; Zhang et al. 2005; Georgiadis et al. 2010). Other ap-
proaches (Weiner and Härder 2010; Bača et al. 2015) take into account the cost of holistic joins
in their cost-based optimization methods.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:27
problem (Polyzotis et al. 2004a). Several research efforts proposed various selectivity estimation
techniques in the XML domain (Aboulnaga et al. 2001; Polyzotis and Garofalakis 2006b; Zhang
et al. 2006; Fisher and Maneth 2007; Polyzotis and Garofalakis 2006a; Teubner et al. 2008). These
techniques can be broadly classified into four main classes in terms of a structure used for
collecting the summary information:
— Synopsis-based estimation techniques: This class uses a tree or graph to represent the
summary information of an XML document (Aboulnaga et al. 2001; Lim et al. 2002; Polyzotis
and Garofalakis 2002, 2006b; Zhang et al. 2006; Polyzotis and Garofalakis 2006a; Fisher and
Maneth 2007).
— Histogram-based estimation techniques: This class uses statistical histograms to cap-
ture the summary information of an XML document (Freire et al. 2002; Wu et al. 2002; Li
et al. 2006).
— Sample-based estimation techniques: These approaches derive a sample from an XML
document, and this sample is subsequently used for the selectivity estimation (Luo et al.
2009).
— Algebra-based estimation techniques: This class uses the existing machinery from the
relational domain (Sakr 2008; Teubner et al. 2008).
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:28 R. Bača et al.
Fig. 15. (a) XML tree. (b) Schema tree with corresponding node counts. (c) XSeed kernel. (d) Binary repre-
sentation of XML document and corresponding SLT grammar. (e) Mapping tables of Li et al. (2006).
estimation (i.e., they provide a low estimation error) when compared to most other techniques.
They first create a compact graph statistical structure called a kernel from an XML document, sub-
sequently updating this kernel according to a query workload. TreeSketch (Polyzotis et al. 2004b)
is the earliest technique of the group. It provides not only a selectivity estimation but also an ap-
proximate result of an TPQ. The XSeed technique (Zhang et al. 2006) addresses mainly recursive
XML documents and recursive queries. Both techniques handle only TPQs without value predi-
cates. The main advantages of both techniques is the small size of a statistical structure and a low
estimation error. On the other hand, they both report a long build time of the TreeBank collection.
For example, Luo et al. (2009) reported on the inability to create the estimation structure for Tree-
Bank within days for both techniques (i.e., TreeSketch and XSeed). The XSeed statistical structure
is constructed by starting with a small kernel having a size and shape similar to the schema tree
of an XML document; this is then incrementally updated through the feedback of queries. The
kernel as such is represented in the form of a label-split graph summary structure proposed by
Polyzotis and Garofalakis (2002). In this graph, each edge e = (u, v) is labeled with a vector of
integer pairs (p0 :c 0 , p1 :c 1 , . . . , pn :c n ), where the ith integer pair (pi :c i ) indicates that, at recur-
sion level i, there are pi and c i elements mapped to the synopsis vertices u and v, respectively.
Figure 15(c) illustrates an example of the XSeed kernel for the XML tree in Figure 15(a).
In Fisher and Maneth (2007), a Straight-Line Tree (SLT) grammar statistical structure is proposed.
This work uses the well-known idea of removing repeated patterns in a tree by removing multiple
occurrences of equal (or similar) subtrees and replacing them with pointers to a single occurrence
of the subtree. The synopsis is constructed by using a tree compression algorithm to generate a
minimum unique Directed Acyclic Graph (DAG) of a binary XML tree (see example in Figure 15(d)),
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:29
and then represents resulting DAG structures using a lossy SLT grammar. Figure 15(d) illustrates
an example of the binary XML tree corresponding to the XML tree in Figure 15(a) and the cor-
responding SLT grammar. An estimation algorithm is designed for a tree automaton to run over
the lossy SLT grammar. XML queries are converted into its equivalent tree automaton and evalu-
ated. The proposed synopsis of this work supports the estimation of all XPath axes, including the
order-sensitive ones, using a memory space efficiently; however, it can deal only with simple path
expressions.
XCluster (Polyzotis and Garofalakis 2006b) is considered a generalized form of the XSketch tree
synopsis, which is a previous work of Polyzotis and Garofalakis (2002). However, the XCluster
synopsis captures correlations also between a structure and values. The approach uses histograms
to approximate the value distribution. The build time of the XCluster synopsis is not included in
the paper; however, since XSketch has a build time of within a few days for TreeBank (Li et al.
2006), we can expect similar properties for XCluster as well.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:30 R. Bača et al.
while having comparable statistical structure size and significantly lower build time. Nevertheless,
the storage requirements for the path id table grow slightly faster than linearly with a number of
root-to-leaf paths, which is not feasible for less regular XML documents. In the case of TreeBank,
the path id table occupies megabytes of storage space since it has several tens of thousands of
root-to-leaf paths, whereas the statistical structures usually only occupy a few kilobytes.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:31
experiments mention the build time (the experimental build time) of 10MB DBLP or XMark
collections within seconds, or if the build method has clearly linear complexity with respect
to XML document size. We consider the build time “medium” (“med.”) if the experimental
build time of 10MB DBLP or XMark collections is within minutes, if the build method has
clearly polynomial complexity, or if accurate results need iterative learning for accurate
predictions. We consider the build time “high” if the experimental build time of 10MB DBLP
or XMark collections is within hours or if the TreeBank collection cannot be built within a
day. We always consider the worst build time. Note that in the case of Wu et al. (2002) there
is not enough data to specify its build time.
In Table 9, we did not include algebra-based techniques (Sakr 2008; Teubner et al. 2008) since
these high-level selectivity estimation techniques can support any TPQ selectivity estimation tech-
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:32 R. Bača et al.
nique. We also did not consider the accuracy of any selectivity estimation since the comparison of
accuracy is even more fuzzy than the comparison of build time. In general, techniques that have
been constantly reporting a low estimation error (Li et al. 2006; Luo et al. 2009) are kernel-based
techniques such as TreeSketch (Polyzotis et al. 2004b) or XSeed (Zhang et al. 2006). Kernel-based
techniques try to avoid join independence assumptions in their synopsis as much as possible,
which leads to high accuracy. On the other hand, their build time is long and it dramatically in-
creases with irregularity of an XML document structure.
In principle, not many techniques capture a correlation between the structure and the node
content. For example, the accuracy of the sampling-based technique (Luo et al. 2009) is highly
dependent on the size of a sample and the actual value distribution. In StatiX (Freire et al. 2002),
relational value histograms are used to make the selectivity estimation of XML queries with values
more accurate. On the other hand, the necessity of an XML schema can be a limitation. XClus-
ter (Polyzotis and Garofalakis 2006b) is a combination of a highly accurate kernel approach with
value histograms; however, the build time can be a limitation for some XML collections.
As can be observed, there is no single technique that is appropriate for all types of XML col-
lections and XML queries. Therefore, we believe that an XDBMS should include at least two tech-
niques that complement each other for an accurate selectivity estimation, with one of the tech-
niques using a kernel (Polyzotis and Garofalakis 2002, 2006a; Polyzotis et al. 2004b; Zhang et al.
2006; Polyzotis and Garofalakis 2006b) since it provides high accuracy and an acceptable build time
for regular XML documents, and another technique using the sampling-based approach (Luo et al.
2009) since it has a low build time that is useful for irregular XML documents, and it supports
XQuery. If the necessity of an XML schema is not a limitation, StatiX (Freire et al. 2002) is also a
good choice that does not need to be complemented by any other technique.
— XML algebras: All existing query algebras have the same goal: to formalize query processing
into a set of elementary operators that will subsequently simplify a query plan cost-based
optimization. We distinguish that only unified algebra supports XQuery and all twig query
joins (binary as well as holistic joins).
— Twig query joins: We have shown that at some points of processing it is more efficient to use
a navigational algorithm instead of a merge-like algorithm when binary structural joins are
used. We also showed that holistic joins overcome the problem of possibly large interme-
diate results produced by binary structural joins. We abstracted the features of twig query
joins from the data structures, and we described their advantages mainly with respect to
the ability to efficiently filter input sequences. Although there are many holistic join algo-
rithms with various features, combined approaches seem to be closest to an optimal holistic
algorithm from the filtering capability point of view. Moreover, we also showed a list of
techniques that can be easily integrated into most of the twig query joins.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:33
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:34 R. Bača et al.
the previous one (Wu et al. 2003)), the performance of query processing for these plans will
have very similar characteristics to the plans of holistic joins, particularly when the query
has a low number of output query nodes.
— Utilization of semantic information of querying constructs: Exploitation of the semantics of
the output query nodes (Lu et al. 2010) and theoretical properties of the join operations (Bača
et al. 2012) in the query model can significantly improve the performance of the query plans.
We expect that similar improvement can be achieved by leveraging the semantics of some
other constructs (e.g., count function or position specification) in the optimization of a query
plan.
— Emerging database platforms: There is a lack of analysis of the performance of XML indexing
mechanisms and join algorithms with respect to emerging database platforms, such as main
memory database systems, highly parallel environments, or platforms that are designed to
combine various types of workloads (Leis et al. 2013; Schuh et al. 2016). Consideration these
emerging platforms in the context of structural XML query processing can generate original
and highly competitive solutions.
— Adoption of structural XML querying techniques for querying emerging data exchange for-
mats: Recently, we have witnessed the emergence of various data exchange formats such
as JSON and RDF. These formats have similar structure to XML. Many ideas for structural
XML can be utilized for efficient querying of JSON and RDF data. For example, containment
labeling in graph vertex reachability queries has been utilized in Trißl and Leser (2007). We
believe that other techniques of structural XML query processing can be useful as well.
— Querying across multiple data formats: Semi-structured data formats (e.g., XML and JSON)
are currently widely utilized in the area of multimodel databases (Lu and Holubová 2017)
in combination with other data formats (e.g., graphs) (Sakr and Pardede 2011). For example,
ArangoDB18 combines key/value pairs, JSON documents, and graph data. Or, MarkLogic19
supports a combination of distinct data models, such as XML, JSON, or RDF. Efficient evalu-
ation of cross-model queries is critical, and exploitation/extension of the described methods
is a promising strategy.
In general, XML querying involves a wide spectrum of aspects. In this article, we focused on
structural XML query processing. Other survey papers and research efforts have considered other
important aspects, such as the keyword search on XML data (Liu and Chen 2011; Zhou et al. 2016;
Agarwal et al. 2016), skyline queries (Cohen and Shiloach 2009), approximate queries (Polyzo-
tis et al. 2004b; Liu and Yan 2016), XML automata (Schwentick 2007), and querying compressed
XML formats (Maneth et al. 2008; Sakr 2009). In addition, with the widespread adoption of the Big
Data phenomena (Sakr and Gaber 2014; Sakr 2016) and the continuous increase of web size and
XML data volumes (Lehner and Sattler 2013), several distributed and parallel XML query process-
ing techniques have been proposed to tackle this challenge (Abiteboul et al. 2006; Cavendish and
Candan 2008; Abiteboul et al. 2008; Figueiredo et al. 2010; Camacho-Rodríguez et al. 2012; Bidoit
et al. 2013; Camacho-Rodríguez et al. 2015; Hammoud et al. 2015). We expect to see increasing
research activities in this direction.
REFERENCES
Serge Abiteboul, Ioana Manolescu, Neoklis Polyzotis, Nicoleta Preda, and Chong Sun. 2008. XML processing in DHT net-
works. In IEEE 24th International Conference on Data Engineering. IEEE, 606–615.
18 https://www.arangodb.com/.
19 http://www.marklogic.com/.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:35
Serge Abiteboul, Ioana Manolescu, and Emanuel Taropa. 2006. A framework for distributed XML data management. In
International Conference on Extending Database Technology. Springer, 1049–1058.
Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffrey F. Naughton. 2001. Estimating the selectivity of XML path expressions
for internet scale applications. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB).
591–600.
Manoj K. Agarwal, Krithi Ramamritham, and Prashant Agarwal. 2016. Generic keyword search over XML data. In Proceed-
ings of the 19th International Conference on Extending Database Technology (EDBT’16). 149–160. DOI:http://dx.doi.org/
10.5441/002/edbt.2016.16
Shurug Al-Khalifa and H. V. Jagadish. 2002. Multi-level operator combination in XML query processing. In Proceedings of
the 2002 ACM CIKM International Conference on Information and Knowledge Management (CIKM). 134–141.
Shurug Al-Khalifa, H. V. Jagadish, Jignesh M. Patel, Yuqing Wu, Nick Koudas, and Divesh Srivastava. 2002. Structural
joins: A primitive for efficient XML query pattern matching. In Proceedings of the 18th International Conference on Data
Engineering (ICDE). 141–152.
Norah Saleh Alghamdi, Wenny Rahayu, and Eric Pardede. 2014. Semantic-based structural and content indexing for
the efficient retrieval of queries over large XML data repositories. Future Generation Computer Systems 37 (2014),
212–231.
Muath Alrammal, Gaetan Hains, and Mohamed Zergaoui. 2011. Path tree: Document synopsis for xpath query selectiv-
ity estimation. In 2011 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS). IEEE,
321–328.
S. Amer-Yahia. 2003. Storage Techniques and Mapping Schemas for XML. Technical Report TD-5P4L7B, AT&T Labs-Research.
Morton M. Astrahan and Donald D. Chamberlin. 1975. Implementation of a structured english query language. Communi-
cations of the ACM 18, 10 (1975), 580–588.
R. Bača and M. Krátký. 2009. On the efficiency of a prefix path holistic algorithm. In Database and XML Technologies (Lecture
Notes in Computer Science), Vol. 5679. Springer–Verlag, 25–32.
Radim Bača, Michal Krátký, Tok Wang Ling, and Jiaheng Lu. 2012. Optimal and efficient generalized twig pattern process-
ing: A combination of preorder and postorder filterings. The VLDB Journal (2012), 1–25.
Radim Bača, Petr Lukáš, and Michal Krátký. 2015. Cost-based holistic twig joins. Information Systems 52 (2015), 21–33.
A. Balmin and Y. Papakonstantinou. 2005. Storing and querying XML data using denormalized relational databases. The
VLDB Journal 14, 1 (2005), 30–49.
Roger Bamford, Vinayak Borkar, Matthias Brantner, Peter M. Fischer, Daniela Florescu, David Graf, Donald Kossmann,
Tim Kraska, Dan Muresan, Sorin Nasoi, and others. 2009. XQuery reloaded. Proceedings of the VLDB Endowment 2, 2
(2009), 1342–1353.
Zhifeng Bao, Tok Wang Ling, Bo Chen, and Jiaheng Lu. 2009. Effective xml keyword search with relevance oriented ranking.
In Proceedings of the 2009 IEEE 25th International Conference on Data Engineering. IEEE, 517–528.
Radim Bača and Michal Krátký. 2008. On the efficient search of an XML twig query in large dataguide trees. In Proceedings
of the 12th International Database Engineering & Applications Symposium, IDEAS 2008. ACM Press.
Radim Bača and Michal Krátký. 2009. TJDewey – On the efficient path labeling scheme holistic approach. In Proceedings of
the 1st International Workshop on Benchmarking of XML and Semantic Web Applications, DASFAA. Springer–Verlag.
Radim Bača, Jiří Walder, Martin Pawlas, and Michal Krátký. 2010. Benchmarking the compression of XML node streams.
In Proceedings of the BenchmarX 2010 International Workshop, DASFAA. Springer-Verlag.
Rudolf Bayer and E. McCreight. 2002. Software pioneers. Springer-Verlag New York. Chapter Organization and Mainte-
nance of Large Ordered Indexes, 245–262.
Dave Beckett and Brian McBride. 2004. RDF/XML Syntax Specification (Revised). W3C. Retrieved from http://www.w3.org/
TR/rdf-syntax-grammar/.
Nicole Bidoit, Dario Colazzo, Noor Malla, Federico Ulliana, Maurizio Nolé, and Carlo Sartiani. 2013. Processing XML
queries and updates on map/reduce clusters. In Joint 2013 EDBT/ICDT Conferences, EDBT’13 Proceedings. Genoa, 745–
748. DOI:http://dx.doi.org/10.1145/2452376.2452470
Nikos Bikakis, Nektarios Gioldasis, Chrisa Tsinaraki, and Stavros Christodoulakis. 2009. Querying XML data with SPARQL.
In International Conference on Database and Expert Systems Applications. Springer, 372–381.
P. V. Biron and A. Malhotra. October 2004. XML Schema Part 2: Datatypes (Second Edition). W3C. Retrieved from
http://www.w3.org/TR/xmlschema-2/.
Christian Bizer, Tom Heath, and Tim Berners-Lee. 2009. Linked data – The story so far. International Journal on Semantic
Web and Information Systems 5, 3 (2009), 1–22.
S. Boag, D. Chamberlin, M. F. Fern’andez, D. Florescu, J. Robie, and J. Sim’eon. December 2010. XQuery 1.0: An XML Query
Language (Second Edition). W3C. Retrieved from http://www.w3.org/TR/xquery/.
Philip Bohannon, Juliana Freire, Prasan Roy, and Jérôme Siméon. 2002. From XML schema to relations: A cost-based ap-
proach to XML storage. In Proceedings of the 18th International Conference on Data Engineering (ICDE). 64–75.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:36 R. Bača et al.
Matthias Brantner, Sven Helmer, Carl-Christian Kanne, and Guido Moerkotte. 2005. Full-fledged algebraic Xpath processing
in Natix. In Proceedings of the 21st International Conference on Data Engineering (ICDE). 705–716.
Tim Bray. 2014. The Javascript object notation (Json) data interchange format. Internet Engineering Task Force (IETF) (2014).
T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. November 2008. Extensible Markup Language (XML)
1.0 (Fifth Edition). W3C. http://www.w3.org/TR/xml/.
Nicolas Bruno, Nick Koudas, and Divesh Srivastava. 2002. Holistic twig joins: Optimal XML pattern matching. In Proceedings
of the ACM SIGMOD International Conference on Management of Data. 310–321.
Jesús Camacho-Rodríguez, Dario Colazzo, and Ioana Manolescu. 2012. Building large XML stores in the Amazon cloud. In
2012 IEEE 28th International Conference on Data Engineering Workshops (ICDEW). IEEE, 151–158.
Jesús Camacho-Rodríguez, Dario Colazzo, and Ioana Manolescu. 2015. Paxquery: Efficient parallel processing of complex
xquery. IEEE Transactions on Knowledge and Data Engineering 27, 7 (2015), 1977–1991.
Dirceu Cavendish and K. Selçuk Candan. 2008. Distributed XML processing: Theory and applications. Journal of Parallel
and Distributed Computing 68, 8 (2008), 1054–1069.
Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the 17th ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 34–43.
Dunren Che, Karl Aberer, and M. Tamer Özsu. 2006. Query optimization in XML structured-document databases. The VLDB
Journal 15, 3 (2006), 263–289.
Dunren Che, Tok Wang Ling, and Wen-Chi Hou. 2012. Holistic boolean-twig pattern matching for efficient XML query
processing. IEEE Transactions on Knowledge and Data Engineering 24, 11 (2012), 2008–2024.
Bo Chen, Tok Wang Ling, M. Tamer Özsu, and Zhenzhou Zhu. 2007. On label stream partition for efficient holistic twig
join. In Database Systems for Advanced Applications, 10th International Conference (DASFAA). 807–818.
Qun Chen, Andrew Lim, and Kian Win Ong. 2003. D(k)-index: An adaptive structural summary for graph-structured data.
In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD’03). ACM, New York,
134–144.
Songting Chen, Hua-Gang Li, Jun’ichi Tatemura, Wang-Pin Hsiung, Divyakant Agrawal, and K. Selçuk Candan. 2006.
Twig2 stack: Bottom-up processing of generalized-tree-pattern queries over XML documents. In Proceedings of the 32nd
International Conference on Very Large Data Bases (VLDB). VLDB endowment, 283–294.
Ting Chen, Jiaheng Lu, and Tok Wang Ling. 2005. On boosting holism in XML twig pattern matching using structural in-
dexing techniques. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD).
455–466.
Zhimin Chen, H. V. Jagadish, Laks V. S. Lakshmanan, and Stelios Paparizos. 2003. From tree patterns to generalized tree
patterns: On efficient evaluation of XQuery. In Proceedings of the 29th International Conference on Very Large Data Bases
(VLDB). 237–248.
B. Choi, M. Mahoui, and D. Wood. 2003. On the optimality of holistic algorithms for twig queries. In Database and Expert
Systems Applications (Lecture Notes in Computer Science), Vol. 2736. Springer–Verlag, 28–37.
J. Clark and S. DeRose. November 1999. XML Path Language (XPath) Version 1.0. W3C. Retrieved from http://www.w3.org/
TR/xpath.
Sara Cohen and Maayan Shiloach. 2009. Flexible XML querying using skyline semantics. In 2009 IEEE 25th International
Conference on Data Engineering. IEEE, 553–564.
Wojciech Czerwinski, Wim Martens, Pawel Parys, and Marcin Przybylko. 2015. The (almost) complete guide to tree pat-
tern containment. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne,
Victoria, Australia, May 31-June 4, 2015. 117–130. DOI:http://dx.doi.org/10.1145/2745754.2745766
Alin Deutsch, Yannis Papakonstantinou, and Yu Xu. 2004. The NEXT logical framework for XQuery. In Proceedings of the
30th International Conference on Very Large Data Bases (VLDB). 168–179.
Paul F. Dietz. 1982. Maintaining order in a linked list. In Proceedings of 14th Annual ACM Symposium on Theory of Computing
(STOC 1982). 122–127.
Matthias Droop, Markus Flarer, Jinghua Groppe, Sven Groppe, Volker Linnemann, Jakob Pinggera, Florian Santner, Michael
Schier, Felix Schöpf, Hannes Staffler, and others. 2008. Bringing the XML and semantic web worlds closer: Transform-
ing XML into RDF and embedding XPath into SPARQL. In International Conference on Enterprise Information Systems.
Springer, 31–45.
Matthias Droop, Markus Flarer, Jinghua Groppe, Sven Groppe, Volker Linnemann, Jakob Pinggera, Florian Santner, Michael
Schier, Felix Schöpf, Hannes Staffler, and others. 2007. Translating xpath queries into sparql queries. In OTM Confeder-
ated International Conferences on the Move to Meaningful Internet Systems. Springer, 9–10.
F. Du, S. Amer-Yahia, and J. Freire. 2004. ShreX: Managing XML documents in relational databases. In VLDB’04: Proceedings
of 30th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., Toronto, ON, Canada,
1297–1300.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:37
E.-S. M. El-Alfy, S. Mohammed, and A. F. Barradah. 2015. XHQE: A hybrid system for scalable selectivity estimation of
XML queries. Information Systems Frontiers (2015), 1–17.
Iman Elghandour, Ashraf Aboulnaga, Daniel C. Zilio, and Calisto Zuzarte. 2013. Recommending XML physical designs for
XML databases. VLDB Journal 22, 4 (2013), 447–470. DOI:http://dx.doi.org/10.1007/s00778-012-0298-2
David C. Faye, Olivier Curé, and Guillaume Blin. 2012. A survey of RDF storage approaches. Arima Journal 15 (2012), 11–35.
Donald Feinberg and Adam M. Ronthal. 2016. Hype Cycle for Information Infrastructure, 2016. Technical Report G00304182.
Gartner.
Thorsten Fiebig, Sven Helmer, Carl-Christian Kanne, Guido Moerkotte, Julia Neumann, Robert Schiele, and Till Westmann.
2002. Anatomy of a native XML base management system. VLDB Journal 11, 4 (2002), 292–314.
Guilherme Figueiredo, Vanessa Braganholo, and Marta Mattoso. 2010. Processing queries over distributed XML databases.
Journal of Information and Data Management 1, 3 (2010), 455.
Jan Finis, Robert Brunel, Alfons Kemper, Thomas Neumann, Franz Färber, and Norman May. 2013. DeltaNI: An efficient
labeling scheme for versioned hierarchical data. In Proceedings of the 2013 ACM SIGMOD International Conference on
Management of Data. ACM, 905–916.
Jan Finis, Robert Brunel, Alfons Kemper, Thomas Neumann, Norman May, and Franz Faerber. 2016. Order indexes:
Supporting highly dynamic hierarchical data in relational main-memory database systems. VLDB Journal (2016),
1–26.
Damien K. Fisher and Sebastian Maneth. 2007. Structural selectivity estimation for XML documents. In Proceedings of the
23rd International Conference on Data Engineering (ICDE). 626–635.
D. Florescu and D. Kossmann. 1999. Storing and querying XML data using an RDMBS. IEEE Data Engineering Bulletin 22,
3 (1999), 27–34.
Marcus Fontoura, Vanja Josifovski, Eugene Shekita, and Beverly Yang. 2005. Optimizing cursor movement in holistic twig
joins. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM’05).
ACM, New York, 784–791.
Juliana Freire, Jayant R. Haritsa, Maya Ramanath, Prasan Roy, and Jérôme Siméon. 2002. StatiX: Making XML count. In
Proceedings of the ACM SIGMOD International Conference on Management of Data. 181–191.
Haris Georgiadis, Minas Charalambides, and Vasilis Vassalos. 2010. Efficient physical operators for cost-based XPath exe-
cution. In Proceedings of the 13th International Conference on Extending Database Technology. ACM, 171–182.
Roy Goldman and Jennifer Widom. 1997. DataGuides: Enabling query formulation and optimization in semistructured
databases. In Proceedings of the International Conference on Very Large Data Bases, VLDB 1997. 436–445.
Nils Grimsmo, Truls A. Bjorklund, and Magnus Lie Hetland. 2010. Fast optimal twig joins. In Proceedings of the 36st Inter-
national Conference on Very Large Data Bases, VLDB 2010. VLDB Endowment.
Torsten Grust. 2005. Purely relational FLWORs. In Proceedings of the 2nd International Workshop on XQuery Implementation,
Experience and Perspectives (XIME-P), in cooperation with ACM SIGMOD.
Torsten Grust. June 4-6, 2002. Accelerating XPath location steps. In Proceedings of the 2002 ACM SIGMOD, Madison, USA.
ACM Press.
Torsten Grust, Jan Rittinger, and Jens Teubner. 2008. Pathfinder: XQuery off the relational shelf. IEEE Data Engineering
Bulletin 31, 4 (2008), 7–14.
Torsten Grust, Sherif Sakr, and Jens Teubner. 2004. XQuery on SQL hosts. In Proceedings of the 29th International Conference
on Very Large Data Bases (VLDB). 252–263.
Torsten Grust, Maurice van Keulen, and Jens Teubner. 2003. Staircase join: Teach a relational DBMS to watch its (axis)
steps. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 524–525.
Alan Halverson, Josef Burger, Leonidas Galanis, Ameet Kini, Rajasekar Krishnamurthy, Ajith Nagaraja Rao, Feng Tian,
Stratis D. Viglas, Yuan Wang, Jeffrey F. Naughton, and others. 2003. Mixed mode XML query processing. In Proceedings
of the 29th International Conference on Very Large Data Bases-Volume 29. VLDB Endowment, 225–236.
Mohammad Hammoud, Dania Abed Rabbou, Reza Nouri, Seyed-Mehdi-Reza Beheshti, and Sherif Sakr. 2015. DREAM:
Distributed RDF engine with adaptive query planner and minimal communication. PVLDB 8, 6 (2015), 654–665.
T. Härder, M. Haustein, C. Mathis, and M. Wagner. 2007. Node labeling schemes for dynamic XML documents reconsidered.
Data & Knowledge Engineering 60, 1 (2007), 126–149.
IBM. 2016. DB2 Database Software. Retrieved from http://www-01.ibm.com/software/data/db2/.
ISO/IEC 9075-14:2003. 2006. Part 14: XML-Related Specifications (SQL/XML). Int. Organization for Standardization.
Sayyed Kamyar Izadi, Mostafa S. Haghjoo, and Theo Härder. 2012. S3: Processing tree-pattern XML queries with all logical
operators. Data & Knowledge Engineering 72 (2012), 31–62.
H. V. Jagadish, Shurug Al-Khalifa, Adriane Chapman, Laks V. S. Lakshmanan, Andrew Nierman, Stelios Paparizos, Jignesh
M. Patel, Divesh Srivastava, Nuwee Wiwatwattana, Yuqing Wu, and Cong Yu. 2002. TIMBER: A native XML database.
VLDB Journal 11, 4 (2002), 274–291.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:38 R. Bača et al.
H. V. Jagadish, Laks V. S. Lakshmanan, Divesh Srivastava, and Keith Thompson. 2001. TAX: A tree algebra for XML. In
Proceedings of the 8th International Workshop on Database Programming Languages (DBPL). 149–164.
Haifeng Jiang, Hongjun Lu, and Wei Wang. 2004. Efficient processing of twig queries with OR-predicates. In Proceedings
of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD). 59–70.
Haifeng Jiang, Hongjun Lu, Wei Wang, and Beng Chin Ooi. 2003a. XR-tree: Indexing XML data for efficient structural joins.
In Proceedings of the 19th International Conference on Data Engineering (ICDE). 253–263.
Haifeng Jiang, Wei Wang, Hongjun Lu, and Jeffrey Xu Yu. 2003b. Holistic twig joins on indexed XML documents. In Pro-
ceedings of 29th International Conference on Very Large Data Bases (VLDB). 273–284.
Enhua Jiao, Tok Wang Ling, and Chee Yong Chan. 2005. PathStack-: A holistic path join algorithm for path query with
not-predicates on XML data. In Proceedings of the 10th International Conference on Database Systems for Advanced Ap-
plications (DASFAA). 113–124.
R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth. 2002a. Covering indexes for branching path queries. In Proceedings
of ACM SIGMOD 2002. ACM Press, 133–144.
R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. 2002b. Exploiting local similarity for indexing paths in graph-structured
data. In Proceedings of the 18th International Conference on Data Engineering, 2001. 129–140. DOI:http://dx.doi.org/10.
1109/ICDE.2002.994703
Lukas Kircher, Michael Grossniklaus, Christian Grün, and Marc H. Scholl. 2015. Efficient structural bulk updates on the
pre/dist/size XML encoding. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 447–458.
M. Klettke and H. Meyer. 2000. XML and object-relational database systems – Enhancing structural mappings based on
statistics. In Lecture Notes in Computer Science, Vol. 1997. 151–170.
Michal Krátký, Jaroslav Pokorný, and Václav Snášel. 2004. Implementation of XPath axes in the multi-dimensional approach
to indexing XML data. In Current Trends in Database Technology, International Workshop on Database Technologies for
Handling XML information on the Web, DataX, EDBT 2004 (Lecture Notes in Computer Science), Vol. 3268. Springer–Verlag.
A. Kuckelberg and R. Krieger. 2003. Efficient structure oriented storage of XML documents using ORDBMS. In Proceedings of
the VLDB’02 Workshop EEXTT and CAiSE’02 Workshop DTWeb on Efficiency and Effectiveness of XML Tools and Techniques
and Data Integration over the Web – Revised Papers. Springer-Verlag, London, UK, 131–143.
Wolfgang Lehner and Kai-Uwe Sattler. 2013. Web-Scale Data Management for the Cloud. Springer. Retrieved from
http://www.springer.com/computer/database+management+%26+information+retrieval/book/978-1-4614-6855-4.
Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory
databases. In 2013 IEEE 29th International Conference on Data Engineering (ICDE) . IEEE, 38–49.
Guoliang Li, Jianhua Feng, Jianyong Wang, and Lizhu Zhou. 2007. Effective keyword search for valuable lcas over xml
documents. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. ACM, 31–40.
Hanyu Li, Mong-Li Lee, Wynne Hsu, and Gao Cong. 2006. An estimation system for XPath expressions. In Proceedings of
the 22nd International Conference on Data Engineering (ICDE). 54.
Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, and Ronald Parr. 2002. XPathLearner: An on-line self-
tuning Markov histogram for XML path selectivity estimation. In Proceedings of International Conference on Very Large
Data Bases (VLDB). 442–453.
Rung-Ren Lin, Ya-Hui Chang, and Kun-Mao Chao. 2013. A compact and efficient labeling scheme for XML documents. In
International Conference on Database Systems for Advanced Applications. Springer, 269–283.
Jian Liu, Z. M. Ma, and Li Yan. 2013. Efficient labeling scheme for dynamic XML trees. Information Sciences 221 (2013),
338–354.
Jian Liu and DL Yan. 2016. Answering approximate queries over XML data. IEEE Transactions on Fuzzy Systems 24, 2 (2016),
288–305.
Jian Liu and X. X. Zhang. 2016. Dynamic labeling scheme for XML updates. Knowledge-Based Systems (2016).
Ziyang Liu and Yi Chen. 2011. Processing keyword search on XML: A survey. World Wide Web 14, 5–6 (2011), 671–707.
Jiaheng Lu, Ting Chen, and Tok Wang Ling. 2004. Efficient processing of XML twig patterns with parent child edges: A look-
ahead approach. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management
(CIKM). 533–542.
Jiaheng Lu and Irena Holubová. 2017. Multi-model data management: What’s new and what’s next?. In Proceedings of
the 20th International Conference on Extending Database Technology, EDBT 2017, Venice, Italy, March 21–24, 2017. Volker
Markl, Salvatore Orlando, Bernhard Mitschang, Periklis Andritsos, Kai-Uwe Sattler, and Sebastian Breß (Eds.). Open-
Proceedings.org, 602–605. DOI:http://dx.doi.org/10.5441/002/edbt.2017.80
J. Lu, T. W. Ling, C. Y. Chan, and T. Chen. 2005a. From region encoding to extended Dewey: On efficient processing of XML
twig pattern matching. In Proceedings of the 31st Conference on Very Large Data Bases, VLDB 2005. VLDB Endowment,
193–204.
Jiaheng Lu, Tok Wang Ling, Zhifeng Bao, and Chen Wang. 2010. Extended XML tree pattern matching: Theories and
algorithms. IEEE Transactions on Knowledge and Data Engineering (TKDE) 2010 23 (2010), 402–416. Issue 3.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:39
Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, and Wei Ni. 2005b. Efficient processing of ordered XML twig pattern.
In Proceedings of DEXA 2005 (Lecture Notes in Computer Science), Vol. 3588. Springer-Verlag, 300–309.
Cheng Luo, Zhewei Jiang, Wen-Chi Hou, Feng Yu, and Qiang Zhu. 2009. A sampling approach for XML query selectivity
estimation. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database
Technology. ACM, 335–344.
Imam Machdi, Toshiyuki Amagasa, and Hiroyuki Kitagawa. 2009. XML data partitioning strategies to improve parallelism
in parallel holistic twig joins. In Proceedings of the 3rd International Conference on Ubiquitous Information Management
and Communication. ACM, 471–480.
Imam Machdi, Toshiyuki Amagasa, and Hiroyuki Kitagawa. 2010. Parallel holistic twig joins on a multi-core system. Inter-
national Journal of Web Information Systems 6, 2 (2010), 149–177.
Sebastian Maneth, Nikolay Mihaylov, and Sherif Sakr. 2008. XML tree structure compression. In Proceedings of the 19th
International Workshop on Database and Expert Systems Applications (DEXA’08). 243–247. DOI:http://dx.doi.org/10.1109/
DEXA.2008.41
Ioana Manolescu, Yannis Papakonstantinou, and Vasilis Vassalos. 2009. XML tuple algebra. In Encyclopedia of Database
Systems. Springer, 3640–3646.
Christian Mathis. 2007. Extending a tuple-based XPath algebra to enhance evaluation flexibility. Informatik-Forschung und
Entwicklung 21, 3–4 (2007), 147–164.
Norman May, Sven Helmer, and Guido Moerkotte. 2004. Nested queries and quantifiers in an ordered context. In 2013 IEEE
29th International Conference on Data Engineering (ICDE). IEEE Computer Society, 239–239.
Philippe Michiels, George A. Mihaila, and Jérôme Siméon. 2007. Put a tree pattern in your algebra. In Proceedings of the
23rd International Conference on Data Engineering (ICDE). 246–255.
Microsoft. 2016. Microsoft SQL Server 2016. Retrieved from http://www.microsoft.com/en-us/server-cloud/products/
sql-server/.
Salahadin Mohammed, Ahmad F. Barradah, and El-Sayed M. El-Alfy. 2016. Selectivity estimation of extended XML query
tree patterns based on prime number labeling and synopsis modeling. Simulation Modelling Practice and Theory 64
(2016), 30–42.
Salahadin Mohammed, El-Sayed M. El-Alfy, and Ahmad F. Barradah. 2015. Improved selectivity estimator for XML queries
based on structural synopsis. World Wide Web 18, 4 (2015), 1123–1144.
MonetDB BV. 2008. MonetDB/XQuery. MonetDB B.V.http://www.monetdb.org/XQuery.
Matthias Nicola, Irina Kogan, and Berni Schiefer. 2007. An XML transaction processing benchmark. In Proceedings of the
2007 ACM SIGMOD International Conference on Management of Data. ACM, 937–948.
Bo Ning, Guoren Wang, and Jeffrey Xu Yu. 2008. A holistic algorithm for efficiently evaluating Xtwig joins. In Proceedings
of the 13th International Conference on Database Systems for Advanced Applications (DASFAA 2008) (Lecture Notes in
Computer Science), Vol. 4947. Springer–Verlag, 571–579.
P. O’Neil, E. O’Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. 2004. ORDPATHs: Insert-friendly XML node labels. In
Proceedings of the 2004 ACM International Conference on Management of Data, SIGMOD 2004. 903–908.
Oracle. 2016. Oracle Database 12c. (2016). https://www.oracle.com/database/.
Fatma Özcan, Normen Seemann, and Ling Wang. 2008. XQuery rewrite optimization in IBM DB2 pureXML. IEEE Data
Engineering Bulletin 31, 4 (2008), 25–32.
Yinfei Pan, Ying Zhang, and Kenneth Chiu. 2008. Parsing XML using parallel traversal of streaming trees. In International
Conference on High-Performance Computing. Springer, 142–156.
Neoklis Polyzotis and Minos Garofalakis. 2006a. XSKETCH synopses for XML data graphs. ACM Transactions on Database
Systems 31 (September2006), 1014–1063. Issue 3.
Neoklis Polyzotis, Minos Garofalakis, and Yannis Ioannidis. 2004b. Approximate XML query answers. In Proceedings of the
2004 ACM SIGMOD International Conference on Management of Data. ACM, 263–274.
Neoklis Polyzotis, Minos Garofalakis, and Yannis Ioannidis. 2004a. Selectivity estimation for XML twigs. In Proceedings of
the 20th International Conference on Data Engineering, 2004. IEEE, 264–275.
Neoklis Polyzotis and Minos N. Garofalakis. 2002. Structure and value synopses for XML data graphs. In Proceedings of the
International Conference on Very Large Data Bases (VLDB). IEEE, 466–477.
Neoklis Polyzotis and Minos N. Garofalakis. 2006b. XCluster synopses for structured XML content. In Proceedings of the
International Conference on Data Engineering (ICDE). IEEE, 63.
Viswanath Poosala and Yannis E. Ioannidis. 1997. Selectivity estimation without the attribute value independence sssump-
tion. VLDB Journal Vol. 97. 486–495.
Lu Qin, Jeffrey Xu Yu, and Bolin Ding. 2007. TwigList: Make twig pattern matching fast. In Proceedings of the 10th Interna-
tional Conference on Database Systems for Advanced Applications (DASFAA). 850–862.
Christopher Re, Jérôme Siméon, and Mary F. Fernández. 2006. A complete and efficient algebraic compiler for XQuery. In
Proceedings of the 22nd International Conference on Data Engineering (ICDE). 14.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:40 R. Bača et al.
Jonathan Robie, Matthias Brantner, Daniela Florescu, Ghislain Fourny, and Till Westmann. 2012. JSONiq: XQuery for JSON.
JSON for XQuery (2012), 63–72.
Kanda Runapongsa and Jignesh M. Patel. 2002. Storing and querying XML data in object-relational DBMSs. In Proceedings of
the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers
(EDBT’02). Springer-Verlag, London, 266–285.
Sherif Sakr. 2007. Cardinality-aware and purely relational implementation of an XQuery processor. Ph.D. Dissertation. Uni-
versity of Konstanz. http://www.ub.uni-konstanz.de/kops/volltexte/2007/3259/.
Sherif Sakr. 2008. Algebra-based XQuery cardinality estimation. International Journal of Web Information Systems 4, 1 (2008),
7–46.
Sherif Sakr. 2009. XML compression techniques: A survey and comparison. Journal of Computer System Science 75, 5 (2009),
303–322.
Sherif Sakr. 2016. Big Data 2.0 Processing Systems - A Survey. Springer. DOI:http://dx.doi.org/10.1007/978-3-319-38776-5
Sherif Sakr and Ghazi Al-Naymat. 2010. Relational processing of RDF queries: A survey. ACM SIGMOD Record 38, 4 (2010),
23–28.
Sherif Sakr and Mohamed Medhat Gaber (Eds.). 2014. Large Scale and Big Data - Processing and Management. Auerbach
Publications. DOI:http://dx.doi.org/10.1201/b17112
Sherif Sakr, Anna Liu, Daniel M. Batista, and Mohammad Alomari. 2011. A survey of large scale data management ap-
proaches in cloud environments. IEEE Communications Surveys and Tutorials 13, 3 (2011), 311–336. DOI:http://dx.doi.
org/10.1109/SURV.2011.032211.00087
Sherif Sakr and Eric Pardede (Eds.). 2011. Graph Data Management: Techniques and Applications. IGI Global. DOI:
http://dx.doi.org/10.4018/978-1-61350-053-8
Airi Salminen and Frank Wm. Tompa. 1994. PAT expressions: An algebra for text search. Acta Linguistica Hungarica 41, 1
(1994), 277–306.
Stefan Schuh, Xiao Chen, and Jens Dittrich. 2016. An experimental comparison of thirteen relational equi-joins in main
memory. In Proceedings of the 2016 International Conference on Management of Data. ACM, 1961–1976.
Thomas Schwentick. 2007. Automata for XML: A survey. Journal of Computer System Science 73, 3 (2007), 289–315.
J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton. 1999. Relational databases for querying
XML documents: Limitations and opportunities. In VLDB’99: Proceedings of the 25th International Conference on Very
Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 302–314.
Lila Shnaiderman and Oded Shmueli. 2015. Multi-core processing of XML twig patterns. IEEE Transactions on Knowledge
and Data Engineering 27, 4 (2015), 1057–1070.
Adam Silberstein, Hao He, Ke Yi, and Jun Yang. 2005. BOXes: Efficient maintenance of order-based labeling for dynamic
XML data. In Proceedings of the 21st International Conference on Data Engineering (ICDE). IEEE, 285–296.
Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. 2002. Storing
and querying ordered XML using a relational database system. In Proceedings of the ACM International Conference on
Management of Data (SIGMOD’02). ACM Press, New York.204–215.
Jens Teubner, Torsten Grust, Sebastian Maneth, and Sherif Sakr. 2008. Dependable cardinality forecasts for XQuery. Pro-
ceedings of the VLDB Endowment (PVLDB) 1, 1 (2008), 463–477.
H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. October 2004. XML Schema Part 1: Structures (Second Edition).
W3C. http://www.w3.org/TR/xmlschema-1/.
Pingfang Tian, Dan Luo, Yaoyao Li, and Jinguang Gu. 2013. XML multi-core query optimization based on task preemption
and data partition. In Joint International Semantic Technology Conference. Springer, 294–305.
Silke Trißl and Ulf Leser. 2007. Fast and practical indexing and querying of very large graphs. In Proceedings of the 2007
ACM SIGMOD International Conference on Management of Data. ACM, 845–856.
University of Washington Database Group. 2002. The XML Repository. Retrieved from http://www.cs.washington.edu/
research/xmldatasets/.
Henrique Valer, Caetano Sauer, and Theo Härder. 2013. XQuery processing over NoSQL stores. In Grundlagen von Daten-
banken. Citeseer, 75–80.
Andreas M. Weiner and Theo Härder. 2009. Using structural joins and holistic twig joins for native XML query optimization.
In Advances in Databases and Information Systems (LNCS), Vol. 5739. Springer, Berlin, 149–163.
Andreas M. Weiner and Theo Härder. 2010. An integrative approach to query optimization in native XML database man-
agement systems. In Proceedings of the 14th International Database Engineering & Applications Symposium (IDEAS’10).
ACM, New York, 64–74.
Y. Wu, J. M. Patel, and H. Jagadish. 2003. Structural join order selection for XML query optimization. In Proceedings of ICDE
2003. IEEE CS, 443–454.
Yuqing Wu, Jignesh M. Patel, and H. V. Jagadish. 2002. Estimating answer sizes for XML queries. In Proceedings of the 8th
International Conference on Extending Database Technology (EDBT). 590–608.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:41
Yanghua Xiao, Ji Hong, Wanyun Cui, Zhenying He, Wei Wang, and Guodong Feng. 2012. Branch code: A labeling scheme
for efficient query answering on trees. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 654–665.
W. Xiao-ling, L. Jin-feng, and D. Yi-sheng. 2003. An adaptable and adjustable mapping from XML data to tables in RDB. In
Proceedings of the VLDB’02 Workshop EEXTT and CAiSE’02 Workshop DTWeb. Springer-Verlag, London. 117–130.
Liang Xu, Tok Wang Ling, Huayu Wu, and Zhifeng Bao. 2009. DDE: From Dewey to a fully dynamic XML labeling scheme.
In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. ACM, 719–730.
B. Yang, M. Fontoura, E. Shekita, S. Rajagopalan, and K. Beyer. 2004. Virtual cursors for XML joins. In Proceedings of the
13th ACM International Conference on Information and Knowledge Management. ACM, 523–532.
Masatoshi Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. 2001. XRel: A path-based approach to storage and retrieval
of XML documents using relational databases. ACM Transactions on Internet Technology 1, 1 (2001), 110–141.
Tian Yu, Tok Wang Ling, and Jiaheng Lu. 2006. TwigStackList-: A holistic twig join algorithm for twig query with not-
predicates on XML data. In Proceedings of the 11th International Conference on Database Systems for Advanced Applica-
tions (DASFAA). 249–263.
Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, and Guy Lohman. 2001. On supporting containment queries in
relational database management systems. In Proceedings of the 2001 ACM SIGMOD International Conference on Manage-
ment of Data. ACM Press, 425–436.
Ning Zhang, Peter J. Haas, Vanja Josifovski, Guy M. Lohman, and Chun Zhang. 2005. Statistical learning techniques for
costing XML queries. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 289–300.
Ning Zhang, M. Tamer Özsu, Ashraf Aboulnaga, and Ihab F. Ilyas. 2006. XSEED: Accurate and fast cardinality estimation
for XPath queries. In Proceedings of the 22nd International Conference on Data Engineering (ICDE). IEEE, 61.
Junfeng Zhou, Wei Wang, Ziyang Chen, Jeffrey Xu Yu, Xian Tang, Yifei Lu, and Yukun Li. 2016. Top-down XML keyword
query processing. IEEE Transactions on Knowledge and Data Engineering 28, 5 (2016), 1340–1353.
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.