Structural XML Query Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Structural XML Query Processing

RADIM BAČA and MICHAL KRÁTKÝ, Department of Computer Science, VŠB – Technical University
of Ostrava, Czech Republic
IRENA HOLUBOVÁ, MARTIN NEČASKÝ, TOMÁŠ SKOPAL, and MARTIN SVOBODA,
Faculty of Mathematics and Physics, Charles University, Czech Republic
SHERIF SAKR, King Saud bin Abdulaziz University for Health Sciences, Saudi Arabia University of New
South Wales, Sydney, Australia

Since the boom in new proposals on techniques for efficient querying of XML data is now over and the
research world has shifted its attention toward new types of data formats, we believe that it is crucial to
review what has been done in the area to help users choose an appropriate strategy and scientists exploit 64
the contributions in new areas of data processing. The aim of this work is to provide a comprehensive study
of the state-of-the-art of approaches for the structural querying of XML data. In particular, we start with
a description of labeling schemas to capture the structure of the data and the respective storage strategies.
Then we deal with the key part of every XML query processing: a twig query join, XML query algebras,
optimizations of query plans, and selectivity estimation of XML queries. To the best of our knowledge, this is
the first work that provides such a detailed description of XML query processing techniques that are related
to structural aspects and that contains information about their theoretical and practical features as well as
about their mutual compatibility and general usability.
CCS Concepts: • Information systems → Database query processing; XML query languages; Data
access methods;
Additional Key Words and Pharses: XML, structural XML query processing
ACM Reference format:
Radim Bača, Michal Krátký, Irena Holubová, Martin Nečaský, Tomáš Skopal, Martin Svoboda, and Sherif Sakr.
2017. Structural XML Query Processing. ACM Comput. Surv. 50, 5, Article 64 (September 2017), 41 pages.
https://doi.org/10.1145/3095798

1 INTRODUCTION
The adoption of the eXtensible Markup Language (XML) (Bray et al. 2008) proposed by the W3C1
as a standard for information exchange has gained great momentum and currently is undoubtedly

1 http://www.w3.org.

This work was partially supported by the Czech Science Foundation project Nr. GAČR 15-08916S, it was conducted within
the ENET project CZ.1.05/2.1.00/19.0389 and Student Grant Competition project with reg. no. SGS SP2017/157.
Authors’ address: R. Bača and M. Krátký, Department of Computer Science, Faculty of Electrical Engineering and Com-
puter Science, VŠB – Technical University of Ostrava, 17. listopadu 15, 708 33 Ostrava – Poruba, Czech Republic; emails:
{radim.baca, michal.kratky}@vsb.cz; I. Holubová (Mlýnková), M. Nečaský, T. Skopal, and M. Svoboda, Department of Soft-
ware Engineering, Faculty of Mathematics and Physics, Charles University, Malostranské nám. 25, 118 00 Praha 1, Czech
Republic; emails: {holubova, necasky, skopal, svoboda}@ksi.mff.cuni.cz; S. Sakr, Software Systems Research Group, Na-
tional ICT Australia (NICTA), ATP lab, Sydney, Australia; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2017 ACM 0360-0300/2017/09-ART64 $15.00
https://doi.org/10.1145/3095798

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:2 R. Bača et al.

the main standard for the representation and exchange of data. Before this happened, we witnessed
a massive boom in techniques that enable efficient storing and querying of XML data. Later, there
appeared two competing strategies: (1) XML-enabled database systems (DBS), typically represented
by the introduction of XML extensions to relational database management systems (RDBMS). For
example, XML is currently used as a data type in most well-known RDBMSs (e.g., Oracle,2 IBM
DB2,3 Microsoft SQL Server4 ). (2) Native XML database systems (or XDBMSs), represented by DBSs
(e.g., MarkLogic,5 Sedna6 ) that have been mainly designed and built for managing XML data and
supporting XML-related technologies (e.g., XPath, XSLT, XQuery). While the opponents of the
former claim that storing hierarchical and semi-structured XML data in flat relations causes inef-
ficiencies in XML query processing, the opponents of the latter emphasize that building a native
XDBMS from “scratch”—including all the “side effects” such as multiuser access, transactions, log-
ging etc.is nontrivial and cannot beat the theoretical and implementational maturity of RDBMSs.
History has shown that the truth is somewhere in between. While efficient XML data querying
requires specific data structures and algorithms, universal RDBMSs (and their robust implementa-
tions) bring the advantage of reasonably efficient processing of any kind of data, including XML.
In practice, XML DBSs have been actively used in several scenarios and application domains.7
For example, modern content management systems (e.g., Drupal,8 Joomla!,9 WordPress10 ) are heavy
users of XML storing, querying, and publishing techniques at their backends. However, recently,
we have been witnessing that the boom in the proposals of new techniques for efficient querying
of XML data is now over and the research world has shifted its attention toward other kinds of
data models and data formats (e.g., JSON (Bray 2014), NoSQL (Sakr et al. 2011), RDF (Beckett and
McBride 2004), or linked data (Bizer et al. 2009)). Therefore, we believe that it is time to review
what has been done in the area. Such an overview, classification, and comparison of approaches
can help users choose an appropriate strategy for a particular application and help researchers
learn more about what has been done and can possibly be exploited in new areas of data process-
ing that are closely related to the XML data model. For example, JSON is commonly described
as a fat-free alternative to XML; it follows the tree structure of the XML data model. However, it
is simpler and more closely matches the data models of most programming languages. In prac-
tice, current NoSQL document DBSs (e.g., MongoDB,11 CouchDB,12 or Azure DocumentDB13 ) use
JSON documents as their main storage unit. XML is also one of the main serialization formats
for the RDF data model and linked data. Therefore, various JSON and RDF storing and querying
techniques (Sakr and Al-Naymat 2010; Faye et al. 2012) and other tree-structured data (e.g., file
systems, parse trees for linguistic data and the secondary structure of RNA) can effectively uti-
lize the query processing techniques of XML data. In addition, various research efforts have been
proposed to bring the XML and RDF worlds closer (Droop et al. 2008), for instance, by querying
XML data using the SPARQL language (Bikakis et al. 2009), the standard query language for the

2 https://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb01int.htm.
3 http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.xml.doc/doc/c0022308.html.
4 https://msdn.microsoft.com/en-us/library/hh403385.aspx.
5 http://www.marklogic.com/.
6 http://www.sedna.org/.
7 https://www.ibm.com/developerworks/community/wikis/home?lang=en_us#!/wiki/pureXML/page/Success%20Stories.
8 https://www.drupal.org/.
9 https://www.joomla.org/.
10 https://wordpress.com/.
11 https://www.mongodb.com/.
12 http://couchdb.apache.org/.
13 https://azure.microsoft.com/en-us/services/documentdb/.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:3

RDF model,14 or by translating XPath queries into SPARQL queries (Droop et al. 2007). Other pro-
posals suggested using the XQuery language for querying JSON data (Robie et al. 2012; Bamford
et al. 2009) and NoSQL stores (Valer et al. 2013). According to Gartner (Feinberg and Ronthal 2016),
there is an increasing trend of a new generation of multimodel database systems (e.g., OrientDB,15
MarkLogic,16 or HPE Vertica17 ) that are designed to support storing data in a combination of related
models and query across them.
In this article, we provide a comprehensive survey of the state of the art of approaches and
related aspects for efficient structural XML data querying. In particular, we start with a description
of labeling schemes to capture the structure of the data and the respective storage strategies. Then
we deal with the key part of every query processing algorithm—a twig query join—as well as
optimizations of XML query processing. In particular, we summarize the main contributions of
this article as follows:
(1) We present a detailed description of up-to-date storage and indices for XML data, as well
as a classification of the physical access methods with regard to node labeling, document
partitioning, and twig query joins used.
(2) We provide a thorough description of the state-of-the-art twig query joins and their com-
parison in terms of their compatibility, features, and supported query models.
(3) We describe XML query algebras and outline their compatibility with twig query joins.
(4) We also discuss the main aspects of cost-based optimization techniques and selectivity
estimation approaches for XML queries. We compare these techniques in terms of sup-
ported query models, twig query join algorithms, and several other practical features as
well.
The rest of this article is structured as follows: In Section 2, we briefly describe the model of
XML data, current XML query languages, as well as labeling schemes that are commonly used in
indexing methods. Section 3 describes major storage and indices for XML data since they funda-
mentally influence the efficiency of query processing. In Section 4, we study the twig query join
problem, and, in Section 5, we deal with XML query algebras. In Section 6, we focus on optimiza-
tions and selectivity estimations of XML query processing. Finally, the conclusion of the article
and future work can be found in Section 7.

2 XML MODEL AND XML QUERY LANGUAGES


Before we focus on XML query processing strategies, we recall basic terms and technologies
used in the remaining parts of this article. Technical details can be found in the referenced W3C
specifications.

2.1 XML and XML Model


For the purpose of machine processing of XML data, we do not view an XML document as a
text document, but we instead use it as a model. Every XML document has to be well-formed,
which basically means that its tags form a hierarchical structure, and, hence, the model is a tree
with several types of data nodes corresponding to elements, attributes, or textual data. In the
following text, we use the term “node” instead of the more correct form of “data node” for the
sake of brevity. An example of an XML document and its XML tree model is depicted in Figure 1.

14 https://www.w3.org/TR/rdf-sparql-query/.
15 http://orientdb.com/orientdb/.
16 http://www.marklogic.com/.
17 http://www.vertica.com/.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:4 R. Bača et al.

Fig. 1. An XML document and its XML tree model.

Note that we ignore references (i.e., IDREF or IDREFS attributes) when considering the XML tree
model (Polyzotis et al. 2004a). In the following text, we use a term labeled path for a sequence of
node names of a path in an XML tree.

2.2 XML Querying


Issues and difficulties of XML querying are mostly observed with respect to the XPath (Clark and
DeRose 1999) and XQuery (Boag et al. 2010) languages, where the latter is actually an extension
to the former. Both these languages are key standards among XML query languages, and so the
corresponding XML query processing algorithms are essential for any XDBMS.
XPath is a query language for selecting nodes from an XML document. The XPath language
provides the ability to navigate within an XML document and select its particular nodes by a
variety of criteria. Every XPath query (at least in its simplest form) consists of a sequence of location
steps in the form of axis::name[predicate], where the result of each location step for a context
node u is a set of nodes where each node u  of this set satisfies the following conditions:
— The relation axis includes (u,u’).
— The name of u’ is name.
— The condition predicate is satisfied for u’.
We can find two major types of constructs in any XPath query: (1) structural constructs that
include navigation in an XML tree, an element or an attribute name selection, a wildcard, Boolean
expressions, and quantifiers; and (2) content constructs that include predicates on the node content
(i.e., the element content or the attribute value) and a comparison of the node content. The con-
tent constructs can be often handled by methods designed for query processing in RDBMS (Bayer
and McCreight 2002; Poosala and Ioannidis 1997; Chaudhuri 1998); a number of works deal with
the content constructs in XML (Liu and Chen 2011; Li et al. 2007; Bao et al. 2009; Agarwal et al.
2016). Here, we are more focused on the structural constructs in this survey. The major structural
query model used by most approaches is called a twig pattern query (TPQ). A basic TPQ Q = (V , E)
is a tree with a set V of query nodes and a set E of edges. Query nodes represent nodes of an
XML document to be retrieved, and edges represent structural relationships between the nodes.
A query node q ∈ V is labeled by a node name. An edge e ∈ E can be of the parent-child (PC) or
the ancestor-descendant (AD) type. PC and AD edges are visualized by single and double lines,
respectively.
TPQs can represent XPath queries that contain child and descendant axes in their steps and
predicates. A sample XPath expression of this kind is depicted in Figure 2 (Q1). If an XPath query is
without branches, we describe it with a simple path expression. The XPath language also contains

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:5

Fig. 2. XPath queries and their corresponding TPQs.

Fig. 3. XQuery queries and their corresponding GTP representations.

other constructs. Thus, there exist different extensions of TPQ for handling various structural
aspects of XPath queries (Ning et al. 2008; Lu et al. 2005b). An example of the major TPQ extensions
is provided in Figure 2 (Q2–Q4). We use a notation for TPQ extension names, where a structural
construct not belonging to a basic TPQ definition is usually written as a superscript of a TPQ name
(i.e., Q2 and Q4).
Algorithms evaluating a TPQ on an XML document are called twig query joins or structural joins.
Any twig query join finds all the occurrences of a TPQ in an XML tree (also called a TPQ matching).
We discuss these in Section 4. Note that TPQ matching is often considered a core problem of XML
querying.
The XQuery language (Boag et al. 2010) extends XPath in many ways. Two important extensions
are the capability to specify output query nodes and an introduction of sequences in values. If we
use only the TPQ model, we would have to perform a TPQ output postprocessing in order to get
an expected XQuery output. To overcome this problem, a generalized query pattern (GTP) (Chen
et al. 2003) is introduced making such postprocessing unnecessary. Figure 3 shows an example of
three XQuery queries with their corresponding GTP models. We use a simplified notation of the
GTP model introduced in Chen et al. (2006), which exploits a circle to mark an output query node,
a square to mark an optional output query node, and a dashed edge to represent an optional rela-
tionship. The optional edge and the optional output node enable a simple representation of the let
clause or an optional XPath construct in the return clause (see Q6 and Q7 for an example). More-
over, values in query matches corresponding to the optional output nodes are (possibly empty)
sequences which reflects the XQuery data model. The simplified GTP model does not consider the
content constructs; it focuses only on the structural constructs of XQuery.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:6 R. Bača et al.

Fig. 4. (a) Containment labeling scheme. (b) Dewey order labeling scheme.

2.3 Labeling Schemes


Most of the twig query joins use a labeling scheme that assigns a unique node label to each node.
Node labels allow us to resolve the following basic operations between two nodes u and v during
query processing:
— LCA: Finds a lowest common ancestor of u and v in an XML tree; it extracts labels of all
ancestors from the node label.
— AD/PC: Resolves AD or PC relationships.
— F/P: Resolves following or preceding XPath relationships.
— Sibl: Resolves following-sibling or preceding-sibling XPath relationships.
— DO: Decides whether u has a lower document order than v or not.
There are two major types of labeling schemes: (1) fixed-length labeling schemes, where the
label has a fixed length, such as the Containment labeling scheme (Zhang et al. 2001) (see an
example in Figure 4(a)) or the Dietz’s labeling scheme (Dietz 1982; Grust 2002), and (2) prefix-
based labeling schemes, where the length of the label is equal to the node depth, such as the Dewey
order (Tatarinov et al. 2002) (see an example in Figure 4(b)). The node label serves as an unique
identifier (node id) of a node in an XML document in many labeling schemes. However, some
labeling schemes (Zhang et al. 2001; Dietz 1982; Lin et al. 2013) have the node id shorter than the
node label itself. For example, the whole label of Dewey order is the node id; on the other hand,
only the first number of the label in the case of Containment is the node id (see Figure 4).
The labeling schemes mentioned in the previous paragraph (Zhang et al. 2001; Dietz 1982;
Tatarinov et al. 2002) are early works and have certain issues, such as poor update performance
or lack of LCA operation support. These issues were the main reason for the introduction of sev-
eral other labeling schemes such as ORDPath (O’Neil et al. 2004), DDE (Xu et al. 2009), or Branch
code (Xiao et al. 2012). A specific way to improve the update capability of the fixed-length labeling
schemes is the introduction of bulk operations (Kircher et al. 2015). CT-label (Lin et al. 2013) is
an approach that aims at reducing the label size while an XML document becomes nearly static.
There are also labeling schemes ignored here, such as compressed branch code (Xiao et al. 2012),
DFPD (Liu et al. 2013), or DPLS (Liu and Zhang 2016), because they can introduce false hits under
certain circumstances.
Table 1 compares the labeling schemes based on the following criteria:
— Basic operation complexity: Depicts the complexity of the basic operations, where d is the
average depth of an XML tree.
— Extra cost: Describes whether there is an additional cost hidden behind the basic operation
complexity (e.g., usage of a modulo or a division).

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:7

Table 1. A Comparison of Labeling Schemes

Labeling Basic operations complexity


Type Extra cost Insert Stor.
scheme LCA AD/PC F/P Sibl DO
Dewey order O(d) O(d) O(d) O(d) O(d) None Low O(dN)
ORDPath O(d) O(d) O(d) O(d) O(d) Detection of odd None O(dN)
numbers;
Prefix-
lengths of labels
based
increase after
many insertions
DDE O(d) O(d) O(d) O(d) O(d) Division of label None O(dN)
numbers
Dietz’s - O(1) O(1) O(1) O(1) None Med O(N)
Fixed- Containment - O(1) O(1) O(1) O(1) None Med O(N)
length Branch code O(d) O(1) O(d) O(1) O(1) Modulo Low O(NlogN)
operation
CT-label - O(1) O(1) O(1) O(1) None High O(N)

— Insert: Depicts the difficulty of a node insertion operation. Insertion time is considered
“None” if there is no need to update labels after an insertion. The insertion time is consid-
ered “Low” if the following-sibling nodes with respect to the new node and their subtrees
need to be relabeled in the worst case. The insertion time is considered “Med” if we need to
relabel the whole tree in the worst case. (There is a gap technique that can accommodate
insertions in most cases while still preserving storage complexity.) The insertion time is
considered “High” otherwise.
— Storage (Stor.): Depicts the space complexity of each labeling scheme mainly with respect
to the number of nodes N .
Clearly, the prefix-based labeling schemes have fast insertions, but this is balanced out by the
cost of basic operations. Therefore, there is no holy grail of labeling schemes, and a tradeoff needs
to be made between the performance of basic operations or the performance of updates. When we
select an appropriate labeling scheme, we have to also consider LCA operation support because it
is required by various twig query joins (see Section 4.2.3).

3 XML STORAGE TECHNIQUES


Query processing strategies are strongly related to how we access data. We start by describing the
basic concepts that are specific for native XDBMSs and XML-enabled DBS in Sections 3.1 and 3.2,
respectively.

3.1 Native XDBMSs


Native XDBMSs build structural indices that allow them to avoid the necessity of accessing an XML
document when resolving structural constructs of XML queries. Section 3.1.1 describes a general
storage concept called the partitioning of an XML document. Sections 3.1.2 and 3.1.3 describe two
data structures called a schema tree and a node index (both structures can be built during prepro-
cessing of an XML document), while Section 3.1.4 summarizes the advantages of various storage
settings.
3.1.1 Partitioning of XML Document. Nodes of an XML document can be easily divided into
disjoint sets (partitions), where each set is identified by its partition label. Partitioning approaches

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:8 R. Bača et al.

Fig. 5. Properties of various XML document partitionings.

can be divided as follows: (1) Those based on a document structure (Zhang et al. 2001; Goldman
and Widom 1997; Chen et al. 2007; Kaushik et al. 2002a), and (2) those considering a typical XML
query workload (Kaushik et al. 2002b; Chen et al. 2003). Tag partitioning (Zhang et al. 2001; Al-
Khalifa et al. 2002) is the most common partitioning based on a document structure. Nodes are
divided according to their tags, and the number of partitions is equal to the number of unique
tags in an XML document. A tag+level (TL) (Chen et al. 2005) is a decomposition of the tag par-
titioning. Labeled path partitioning (Yoshikawa et al. 2001; Chen et al. 2005; Krátký et al. 2004) is
another common partitioning approach, sometimes also called prefix path streaming (PPS) (Chen
et al. 2005). Two nodes are equivalent in the labeled path partitioning if they have the same root-
to-node labeled path. The forward & backward partitioning (F&B) introduced by Kaushik et al.
(2002a) has the highest number of partition labels. Two nodes are equivalent in F&B if they have
the same root-to-node labeled path and all node-to-leaf labeled paths. Each of these partitionings
is actually a decomposition (refinement) of another one, as illustrated in Figure 5. On one hand, tag
partitioning produces a low number of partitions with a high number of nodes in each partition.
On the other hand, F&B provides the opposite property: A possibly high number of partitions with
a low number of nodes per each partition. We can also use a more refined partitioning approach
to speed up query processing. For example, labeled path partitioning allows us to resolve a simple
path expression without any AD relationships by a single random access into a node index (see
Section 3.1.3). Properties of different partitioning methods are discussed in detail in Section 3.1.4.
Another type of partitioning is semantic partitioning (Alghamdi et al. 2014) that partitions the XML
document according to the structure specified in an XML schema. Such an approach is similar to
labeled path partitioning in many cases; however, using semantic partitioning, we give a user the
possibility to influence the way we partition the data.

3.1.2 Schema Tree. A schema tree for an XML document is a labeled tree in which each labeled
path occurs only once. An example of a schema tree is depicted in Figure 15(b). In the literature,
there are many names for this tree: DataGuide (Goldman and Widom 1997), path tree (Aboulnaga
et al. 2001), summary tree, summary index, path index, and so on. A schema tree is useful for the
following purposes: (1) to determine partition labels corresponding to a query (Bača and Krátký
2008) when a more refined partitioning is used (e.g., tag+level, or labeled paths), (2) to support a
simple selectivity estimation (Aboulnaga et al. 2001), and (3) to get general knowledge about an
XML document structure (e.g., to support an auto-completion feature in an XML editor).

3.1.3 Node Indices. Node labels are not stored in a schema tree structure; instead, they are
stored in a node index as values. There are two basic types of node indices: (1) those having a node
id as a key (document indices), and (2) those having a partition label as a key (partition indices).
Note that the document index has node id as a key and node label as a value, which can be almost
the same information in some labeling schemes. However, some other information about the
node (i.e., pointer to the node content) can be stored in the value of the document index. Nodes
corresponding to one partition label are sorted according to the node label in the partition index;
therefore, a sequential scan of the key’s list is usually necessary during query processing. This

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:9

Fig. 6. (a) Document index. (b) Partition index. (c) Partition index with indexed lists.

can be improved by an index (XB-tree (Bruno et al. 2002) or XR-tree (Jiang et al. 2003a)) built over
each list. All types of node indices are schematically depicted in Figure 6.
From the query processing perspective, a document index is very useful when we have a small
set of context nodes and we want to use it to process the remaining relationships of a query.
This type of query processing can be considered navigational and is thoroughly described in Sec-
tion 4.1.2. On the other hand, many twig query joins are based on the partition index (see Section 4).
This type of join is mainly focused on the merge that occurs during one sequential scan of lists that
removes irrelevant nodes (Bruno et al. 2002; Al-Khalifa et al. 2002). “List” in the partition index is
called stream in the join, and we use the term sequence in the following text.
A specific category of node indexes address fast XML updates (Finis et al. 2016, 2013; Silberstein
et al. 2005). These works use an index with a structure similar to a document index. They use
physical node position instead of a node label, and they provide much better update performance;
the specific type of update operation (such as a tree relocation) has logarithmic complexity instead
of linear, as in the case of all labeling schemes mentioned in Section 2.3. On the other hand, they
do not offer the query performance provided by the partition index.

3.1.4 Balancing XML Storage. From the query processing point of view, the selection of parti-
tioning influences the size of three problems: (1) the overall amount of nodes that have to be read
from a node index, (2) the number of random accesses into a node index, and (3) the necessity to
find all the partition labels in a potentially large schema tree. If we minimize the first problem by
using more refined partitioning, then the latter two problems of query processing increase and
vice versa. This behavior also depends on query workload. For example, the F&B partitioning al-
lows us to process a TPQ without AD edges by a single random access to a node index. On the
other hand, the number of partition labels can be very close to the overall number of nodes in an
XML tree, and it can take a long time to find them. In such a case, the problem of TPQ matching
is shifted to the problem of finding all the partition labels in a schema tree, which is an identical
problem and so the same algorithms can be used (Goldman and Widom 1997).
As mentioned in Section 3.1.1, there are also partitioning techniques that are based on a typical
XML query workload (Kaushik et al. 2002b; Chen et al. 2003). These approaches take into account
typical queries, and the partitioning is created with respect to them. They are parameterized by a

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:10 R. Bača et al.

constant k (A(k )-index (Kaushik et al. 2002b), D (k )-index (Chen et al. 2003)), where k represents the
length of a labeled path that is considered during the partitioning. The number of partition labels
increases with increasing k. These approaches minimize all three query processing problems of
partitioning mentioned earlier; however, they are only effective for a specific workload.

3.2 XML-to-Relational Mapping


XML-to-relational mapping denotes how the XML data are stored and mapped into relations. We
usually speak about the shredding of XML data into tables. Once we have such a mapping, the
XML queries over the XML data can be translated to SQL queries over the relational tables, and
then the obtained SQL queries can be evaluated in a standard way for the RDBMS. On the basis
of exploitation or omitting information from an XML schema, we can distinguish between the so-
called generic and schema-driven methods. From the input data point of view, we can talk about
so-called fixed methods that store data purely on the basis of their model, and adaptive methods,
where sample XML documents and XML queries are also taken into account to find a more efficient
storage strategy. Moreover, there are also techniques based on user involvement; these can be
divided into user-defined and user-driven, depending on the extent of user interaction.
3.2.1 Generic vs. Schema-Driven Methods. Generic mapping methods (Florescu and Kossmann
1999; Kuckelberg and Krieger 2003) do not use the XML schema of stored XML documents (even
if it exists). They are usually based on one of the following two approaches: (1) creating a general
(object-)relational schema into relations of which any XML document regardless of its structure
can be stored, or (2) creating a special kind of (object-)relational schema into relations where only
a certain collection of XML documents that have a similar structure can be stored.
A typical representative of generic mapping is a group of methods called structure-centered map-
ping (Kuckelberg and Krieger 2003). It considers all nodes of an XML tree T having the same
structure defined as a tuple v = (t, l, c, n), where t is the type of the node (e.g., element, attribute,
text, ...), l is the node label, c is the node content, and n ∈ {1, ..., n} is a list of successor nodes. The
problem is how to store the lists of successor nodes. In the Foreign Key Strategy, every tree node v
is mapped to a tuple with a unique identifier and a foreign key reference to the parent node. This
method is quite simple, and the stored tree can be easily modified; however, the retrieval of the
data involves many self-join operations. In the Depth First (DF) Strategy, each node of T is provided
with an index value which represents its position in T ; this is similar to the Containment labeling
scheme described in Section 2.3.
On the other hand, schema-driven mapping methods (Shanmugasundaram et al. 1999;
Runapongsa and Patel 2002) are based on an existing schema S 1 of stored XML documents, ex-
pressed in DTD (Bray et al. 2008) or XML Schema (Thompson et al. 2004; Biron and Malhotra
2004), which is mapped to an (object-)relational database schema S 2 . The data from XML docu-
ments that have been validated against S 1 are then stored into relations of S 2 . The best-known and
probably the first representative of schema-driven mapping methods is a group of three algorithms
for mapping a DTD to a relational schema: Basic, Shared, and Hybrid (Shanmugasundaram et al.
1999). The main idea, further used in all successive approaches, is based on a definition of a directed
graph, a so-called DTD graph, which represents the processed DTD. The algorithms try to grad-
ually improve the idea “to create one relation for each element composed of its attributes, and to
map element–subelement relationships using keys and foreign keys,” while they differ according
to the amount of redundancy they may cause.
3.2.2 Fixed vs. Adaptive Methods. Adaptive methods (Klettke and Meyer 2000; Bohannon et al.
2002; Xiao-ling et al. 2003) focus on the idea that each application, represented using sample data
and operations (i.e., queries), requires a different storage strategy to achieve optimal efficiency.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:11

A representative of adaptive schema-driven mapping methods is an algorithm proposed in the


LegoDB system (Bohannon et al. 2002). First, the method defines a fixed mapping of XML schema
structures (in particular, simpler but semantically equivalent p-schemas) to relations. The flexibil-
ity is based on the idea of exploring a space of possible XML-to-relational mappings and selecting
the best one according to statistics on a sample set of XML documents D and queries Q. In order to
select the best mapping, the system in turns applies the following two steps to the source p-schema
until a good result is achieved:
(1) Any possible XML-to-XML transformation (e.g., inlining/outlining; i.e., storing an XML
node in the parental/separate table, wildcards rewriting, etc.) is applied to the p-schema.
(2) XML-to-relational mapping (similar to the previously described fixed ones) is applied to
the new p-schema; against the resulting relational schema, queries from Q over data from
D are estimated.
As the space of generated p-schemas is possibly infinite, the paper also proposes a greedy eval-
uation strategy that explores only the most interesting subset.
3.2.3 User-Defined vs. User-Driven Methods. Both user-defined and user-driven approaches are
based on the same idea of leaving the whole process in the hands of a user who knows the target
application best. User-defined (Amer-Yahia 2003) mapping methods require that the user first de-
fines the target relational schema and then expresses a required mapping between an XML schema
and a relational schema using a system-dependent mechanism (e.g., a special query language, a
declarative interface, etc.). The problem is that such an approach assumes that users are skilled
in two complex technologies. Hence, most existing systems support a kind of user-driven map-
ping (Balmin and Papakonstantinou 2005; Du et al. 2004) where the necessary effort of the user is
lower. The user can only influence a default fixed mapping strategy using annotations that specify
the required mapping for particular schema fragments. Probably the first approach that addresses
these issues was proposed in the ShreX system (Du et al. 2004). The mapping specifications are
made by annotating the input XML schema definition with a predefined set of annotations, such
as outline (indicating whether the XML schema fragment should be inlined/outlined), tablename
or columnname (indicating requested names of tables/columns), maptoclob (indicating mapping
of the XML schema fragment to the CLOB data type), and the like.
Considering the XML-to-relational mapping strategies, all three main commercially available
RDBMSs (Oracle DB (Oracle 2016), IBM DB2 (IBM 2016), Microsoft SQL Server (Microsoft 2016))
support a kind of shredding into relations, both automatic and user-driven. In the case of auto-
matic methods only fixed approaches are supported, whereas in the case of user-driven methods
a set of system-specific annotations, closely related to the supported fixed mapping approaches, is
provided. However, all the systems still require a highly skilled user and, hence, user-driven ap-
proaches have become obsolete. On the other hand, we can also witness a “rapprochement” with
native approaches. All three systems, in accordance with the SQL/XML standard (ISO/IEC 9075-
14:2003 2006), also support a kind of XML data type together with related index structures that are
ablet to store XML data in a native way.

4 TWIG QUERY JOINS


As discussed in Section 2.2, the basic problem of XML querying is finding all occurrences (query
matches) of a TPQ in an XML tree (Czerwinski et al. 2015). Algorithms addressing this problem
are usually called twig query joins, and they represent a basic operator in an XML query algebra
(see Section 5). Unlike in the relational domain, where the term “join” stands for an algorithm-
independent operation, in this survey we use the term “twig query join” for algorithms that solve
the TPQ matching problem. The existing twig query joins use a data structure API, outlined in
ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:12 R. Bača et al.

Table 2. Data Structure API

Navigational API Stream API


children(node, partition labels): sequence open(partition label): sequence
descendants(node, partition labels): sequence
parent(node, partition labels): sequence
ancestors(node, partition labels): sequence

Table 3. Sequence API

Sequence API
next(): node
finished(): bool

Table 2, and consist of two parts: (1) a navigational API and (2) a stream API. Each method in
these APIs returns a sequence of nodes which can be manipulated according to a sequence API
(see Table 3). Each method in the navigational API returns a sequence that corresponds to an
input partition label and satisfys an XPath axis with an input node. It should be noted that some
approaches also implement other XPath axes such as following or preceding in the navigational
API (Grust 2002). The stream API method returns a sequence corresponding to an input partition
label, and the sequence API contains methods to manipulate a sequence.
Operations in the APIs have significantly different performance characteristics depending on
available node indices (see Section 3.1.3). In general, the document index supports both the navi-
gational and stream APIs, whereas the partition index supports only the stream API. However, the
document index has to access nodes using a sequential scan of a B-tree containing many irrelevant
nodes when processing the stream API, and this is inefficient. A query processed in the partition
index accesses only nodes relevant to a partition label; therefore, it is crucial for the efficient pro-
cessing of all merge-like algorithms (see Sections 4.1.1 and 4.2). The selection of an appropriate
index and algorithm is a task for a cost-based optimizer (see Section 6.1).
Before we describe particular algorithms, we specify the general steps that can be identified as
parts of any twig query join as follows:

— Filtering: An algorithm scans sequences and filters out nodes not corresponding to any
query match. In this stage, algorithms use main memory filtering data structures to get rid
of the maximum number of nodes that are irrelevant to a query. The most common filtering
data structure is a stack or a set of stacks. The main feature of the filtering data structure
is that it can help to decide whether the node is useless or not in constant time. Of course,
every filtering can have false hits.
— Intermediate storage: Every join uses some type of intermediate storage where nodes are
stored before a query output is enumerated. In some approaches, a filtering data structure
is used as intermediate storage as well.
— Output enumeration: In this step, algorithms read the intermediate storage and generate
an output that is usually in a form of ordered tuples. The major task of this step is often
tuple ordering according to a query model.

One of the most significant differences among joins is the method used during the filtering step.
The first type of filtering methods focuses only on a pair of query nodes, and the twig query join
is processed as a set of binary structural joins. By contrast, holistic joins filter input nodes based on

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:13

Fig. 7. Categories of algorithms utilized for twig query joins.

Table 4. Major Features of Twig Query Join Categories

Pros Cons
Binary structural joins Easily integrated into any Efficiency is significantly
XML query algebra and dependent on the selection of
support all XPath axes a good query plan. They can
produce a large intermediate
result when compared to the
query output.
Top-down holistic joins Linear I/O complexity of A sequential scan of an
query processing with intermediate result is
respect to the sum of output required even if it contains
and input sizes for some many useless nodes (see
query types and unnecessary Section 4.2.1).
query plan optimizations
Bottom-up holistic Linear CPU and I/O They can produce a large
joins complexities of the output intermediate result compared
enumeration with respect to to the query output (see
the output size Section 4.2.2).

information from all query nodes. Figure 7 summarizes the major categories of twig query joins,
and Table 4 depicts their main features.

4.1 Binary Structural Joins


The first type of twig query join works only with a pair of query nodes connected with an XPath
relationship. In principle, there are two basic types of binary structural joins:
— Merge-like algorithms, which are similar to the relational merge join. Algorithms of this
category read both input sequences and perform a merge.
— Navigational algorithms, which are similar to the relational nested-loop join with an
index. A navigational algorithm reads a sequence, and, for each node from this sequence, it
performs an appropriate navigational API operation. This algorithm is called “navigational”
since the operations are similar to DOM navigation in an XML document.
4.1.1 Merge-Like Algorithms. An earlier work (Zhang et al. 2001) showed that an algorithm
called MPMGJN can significantly outperform classical relational join algorithms. MPMGJN is
based on the relational merge join algorithm using sequences as an input. Al-Khalifa et al. (2002)
introduced two algorithms called Tree-Merge and Stack-Tree. We first describe the key features
of Tree-Merge, which is a generalized version of MPMGJN. The time and space complexities of

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:14 R. Bača et al.

Tree-Merge differ depending on the axis being resolved by the algorithm. In the case of the AD
relationship, the worst-case time and space complexities are linear with respect to the sum of the
input sequence and the output data sizes. In the case of the PC relationship, the time complexity
degrades. This is because Tree-Merge must work in the same way as for AD relationships, with
a difference being that each output pair is checked to determine whether it satisfies the PC rela-
tionship. Therefore, we can possibly check many irrelevant pairs in the case of a PC relationship.
The problem with PC edges is common to many twig query joins, as we will show in Section 4.2.
Moreover, Tree-Merge also has a problem with I/O complexity when two or more nodes in the
sequence have AD relationships. In that case, Tree-Merge must read certain parts of a sequence
repeatedly. Generally, Tree-Merge is rather inefficient when an input XML document contains re-
cursive data, even for AD relationships, since in such a case the algorithm performs repeated scans
of the sequences.
Al-Khalifa et al. also proposed another algorithm, called Stack-Tree, to solve the I/O complexity
problem of Tree-Merge. Stack-Tree exploits a stack that prevents repeated comparisons; therefore,
Stack-Tree reads both input sequences only once. The concept of stacks has fundamentally influ-
enced most later approaches in the area, and holistic algorithms are built on similar principles as
well. Stack-Tree maintains a stack containing only nodes lying on the same root-to-leaf path. In
other words, each node of the stack is a descendant of each node below it. This prevents repeated
reads of input sequences and ensures a linear worst-case I/O complexity with respect to the sum
of the input and output sizes in the case of PC and AD relationships.

4.1.2 Navigational Algorithms. Binary structural join are another type of navigational algo-
rithm (Grust et al. 2003; Halverson et al. 2003; Zhang et al. 2005). This type of query processing per-
forms a navigational operation for each node from the input sequence. The major representative of
navigational algorithms is the StairCase join (Grust et al. 2003). Let us consider an input sequence
(context nodes) corresponding to a query node. This sequence is usually an intermediate result
created by previous query processing. The StairCase join is based on XPath accelerator (Grust
2002), which is a simple algorithm that iteratively searches for successors of every context node
while avoiding duplicate scans of XPath accelerator. Although the complexity boundaries of the
StairCase join have not been proved, this algorithm was shown to be fast on a real implementation,
MonetDB/XQuery (MonetDB BV 2008).

4.2 Holistic Twig Joins


Bruno, Koudas, and Srivastava (2002) presented an approach that overcomes the problem of the
possibly large intermediate results produced by binary structural joins by evaluating a TPQ holis-
tically. They propose two algorithms called PathStack and TwigStack to set up a new family of twig
query joins called holistic twig joins (or holistic joins). From the perspective of the categorization
applied in the case of binary structural joins, holistic joins would all be categorized as merge-like
algorithms.
We briefly introduce several terms that are used to classify and explain the differences among
holistic joins. First, we define two types of occurrences of a TPQ Q in an XML document D:

— Query match of Q in D is a tuple of nodes from D where: (1) each query node q ∈ Q has
exactly one corresponding node in the tuple, (2) each pair of nodes satisfies the XPath rela-
tionship between the corresponding query nodes.
— Weak query match of Q in D is defined in the same way as the query match of Q, but with
a difference being that a pair of nodes can only have an AD relationship even though the
corresponding query nodes have a PC relationship.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:15

Every weak query match is a relaxed version of the query match. If algorithm filtering detects
a weak query match with a node, it does not mean that a query match also exists. However, it is
much easier and faster to look only for weak query matches during the filtering step.
Holistic algorithms apply filtering to decide whether a node belongs to a (weak) query match
or not. We define the following types of nodes:
— Useless node nq : It is confirmed that no query match exists of Q containing nq .
— Weakly matched node nq : The existence of a weak query match of q with nq is confirmed.
— Matched node nq : The existence of a query match of a subquery rooted by q containing nq
is confirmed.
— Fully matched node nq : The existence of a query match of Q with nq is confirmed.
Note that any matched node can still be useless. We define the last type of a query match com-
monly evaluated during the filtering step of a join. Let Qp (q) be a set of query nodes on the path
from the root query node to a query node q. A prefix query match of q in an XML document D is
simply a query match of Qp (q) .
We recognize two main types of filtering approaches: (1) top-down filtering (Bruno et al. 2002;
Lu et al. 2004, 2005a), which skips most of the irrelevant nodes before they are stored on stacks,
and (2) bottom-up filtering (Chen et al. 2006; Qin et al. 2007), which skips most of the irrelevant
nodes when they are popped out from their stacks. Top-down filtering has a significant feature:
For some types of queries, it stores only fully matched nodes in a filtering data structure and inter-
mediate storage. We consider that a holistic algorithm is optimal in such a case. The major features
of an optimal holistic algorithm are linear worst-case time and I/O complexities with respect to the
sum of the input and output data sizes. Holistic algorithms utilizing top-down filtering are thor-
oughly described in Section 4.2.1. Bottom-up filtering optimizes another important aspect of the
holistic algorithm: output enumeration. We describe holistic algorithms using bottom-up filtering
in Section 4.2.2.

4.2.1 Top-Down Approaches. As mentioned in Section 4.2, TwigStack and PathStack represent
the first holistic approaches. They use one stack for each query node as a filtering data structure,
similarly to Stack-Tree. Nodes stored in the same stack all have AD relationships. Before we push
a node on a stack, we pop out all nodes that are not its ancestors. This behavior leads to linear
space complexity of filtering with respect to the depth of an XML document, which is typical for
all top-down approaches.
In general, PathStack is a very simple algorithm performing a prefix match filtering that filters
out every node nq not having any corresponding prefix query match of q. In other words, Path-
Stack considers only a query path to q, and, therefore, it works very efficiently for simple path
expressions (i.e., queries without predicates). On the other hand, PathStack can perform poorly
on twig queries. This is a strong field for TwigStack, because TwigStack considers the whole TPQ
during filtering. Before TwigStack checks the prefix match of q, it performs a weak match filtering
that filters out all nodes that are not weakly matched. The weak match filtering is represented by
a filtering function called getNext. Bruno et al. prove that TwigStack is optimal if TPQ contains
only AD relationships. In other words, for such a query, getNext returns only matched nodes, and
only fully matched nodes are stored on a stack. The getNext filtering function further introduces
an unnecessary node skipping that allows it to possibly skip many nodes in a sequence at the same
time. This feature can be used effectively when sequences are indexed (see Section 4.3).
Another important aspect of each holistic algorithm is output enumeration. Many top-down al-
gorithms use the output enumeration proposed by TwigStack. TwigStack filtering generates root-
to-leaf path query matches called solutions for each root-to-leaf query path in a TPQ. We need to

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:16 R. Bača et al.

Fig. 8. Example of TwigStack’s iterations performed during the query processing of Q8 and Q9 on the XML
tree.

sort the solutions in order to support their efficient output enumeration. An intermediate stor-
age can contain many duplicate nodes, and its size can be even bigger than the size of an input
XML tree in the worst case. Moreover, the enumeration sequentially reads arrays, which results
in unnecessary overhead when many irrelevant nodes are stored in intermediate storage. These
problems are addressed by the bottom-up algorithms discussed in Section 4.2.2.
The basic principles of TwigStack are depicted in Figure 8. Each iteration of TwigStack calls the
filtering function that returns a query node q, and TwigStack stores a head node of the sequence
Tq on the stack Sq if a prefix match exists (note that the head node of Tq has to be weakly matched
due to the filtering function call). Solutions are produced every time we push a node corresponding
to the leaf query node (i.e., 4th , 5th , and 7th iterations) on a stack. Actions performed during each
iteration are the same whether we process Q8 or Q9 because weak match filtering considers every
PC relationship as AD. As a result, the solutions produced in the example are irrelevant for Q9,
which is resolved during enumeration.
The TwigStack algorithm is optimal for a query containing only AD relationships. The weak
match filtering of the algorithm can push useless nodes onto a stack when a query contains a
PC relationship, as demonstrated in the example. As noted in Choi et al. (2003), the optimality
of weak match filtering can be improved in two ways: (1) by multiple scans of a sequence and
(2) by relaxing the data storage model (e.g., using a more refined XML document partitioning, as
described in Section 3.1.1). Choi et al. showed that the cost of the filtering that would be optimal for
any TPQ is too high. Therefore, algorithms improving the optimality of weak match filtering do
that only for a subset of twig queries. TwigStackList (Lu et al. 2004) represents the first instance of
optimality improvement (i.e., it allows multiple scans of nodes in a sequence), and iTwigJoin+PPS
and iTwigJoin+TL (Chen et al. 2005) represent the second instance (i.e., they relax the data storage
model). The iTwigJoin algorithm is an elegant combination of the holistic join and refined XML
document partitioning introduced in Section 3.1.1.
Figure 9 contains the holistic algorithms for improving TwigStack optimality, and they introduce
other TPQ types that are optimally processed. Arrows in Figure 10 represent techniques utilized
to enhance the optimality of each algorithm. The TJFast (Lu et al. 2005a) and TreeMatch (Lu et al.
2010) algorithms are described in Section 4.2.3. In general, the optimality of a holistic algorithm
does not have to be related only to a query type. Bača and Krátký (2009) and Bača et al. (2012) in-
troduced optimality based on XML document features as well. This approach shows that a holistic

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:17

Fig. 9. TPQ types and algorithms that are optimal for their processing.

Fig. 10. Techniques utilized to enhance optimality of holistic algorithms.

algorithm using weak match filtering can be optimal under tag+level or labeled path partitioning if
an XML document does not contain any recursive nodes. They also show that, for many common
XML collections, TwigStack is optimal if labeled path partitioning is used.
4.2.2 Bottom-Up Approaches. As mentioned earlier, the optimality of a holistic algorithm is not
always the main criterion for its efficiency. Another important aspect of a holistic algorithm is fast
output enumeration. All top-down algorithms use TwigStack output enumeration, which is not
very efficient (as mentioned in Section 4.2.1). Another problem of TwigStack output enumeration
is that it returns the result in the form of query matches; they do not reflect information about
output query nodes, and they need additional post-processing to get the output expected by an
XQuery.
These problems are addressed by the Twig2 Stack algorithm (Chen et al. 2006), which ensures a
linear output enumeration with respect to the result size and works with the GTP query model.
Twig2 Stack uses a filtering data structure called a hierarchical stack (HS) and subtree match filtering
that filters out every node that is not matched. That is true regardless of query type. The HS also
serves as an intermediate storage, and, therefore, nodes are pushed onto the HS in postorder. A
mechanism of a node push and the structure of an HS is demonstrated in Figure 11(a). The example
contains two situations: before the push of node a 1 and after the push of node a 1 . The enumeration
algorithm starts with an HS corresponding to the root query node, and only nodes relevant to the
result are accessed during output enumeration.
Compared to the top-down approaches, Twig2 Stack has the following two disadvantages: (1) its
subtree match filtering accesses nodes in an HS that are descendants of the currently evaluated
node, and (2) before the node is pushed onto an HS, the algorithm does not check its prefix query
match (i.e., prefix match filtering is missing). The first disadvantage means that the space complex-
ity of the filtering is linear with respect to the XML tree size, and Twig2 Stack can spend additional
time merging many nodes during the push onto an HS. The second disadvantage is that it pushes
many useless nodes onto HSs. For example, all nodes corresponding to leaf query nodes are pushed.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:18 R. Bača et al.

Fig. 11. (a) Example of Twig2 Stack’s hierarchical stacks created during the query processing of the TPQ on
the XML tree. (b) TwigList’s lists.

Therefore, Twig2 Stack proposed an improvement where the PathStack’s prefix match filtering is
applied before a node is pushed onto an HS. However, many useless nodes are still pushed onto
an HS (see PathStack in Section 4.2.1).
The first disadvantage of Twig2 Stack is improved by the TwigList algorithm (Qin et al. 2007) by
using an approach where an HS is replaced by a simple array. TwigList needs only the last node
inserted into each array during the subtree match filtering; another advantage is that the array
can be easily paged on a secondary storage if needed. Nodes in the array are stored in postorder,
which means that the result has to be subsequently postprocessed when we require a result pre-
sented according to document order. When a node nq is appended to the end of the array, it is
also associated with a pointer to its first and last descendants corresponding to each child of q.
We show the arrays of TwigList in Figure 11(b). The arrays are much simpler than HSs, which
simplifies their maintenance and allows persistence. One disadvantage of TwigList’s intermediate
storage is the suboptimal handling of PC edges, leading to the fact that TwigList stores even more
irrelevant nodes in intermediate storage than Twig2 Stack. However, this behavior is simply cor-
rected in Grimsmo et al. (2010) by the introduction of level split vectors that represent an improved
TwigList’s intermediate storage.

4.2.3 Combined Approaches. The main disadvantage of the bottom-up approaches, when com-
pared to top-down approaches, is represented by insufficient filtering before a node is inserted into
intermediate storage. A possible solution is a combination of top-down filtering and bottom-up
output enumeration. Basically, Twig2 Stack+PathStack, introduced in Chen et al. (2006), represents
such a solution; however, PathStack uses only prefix path filtering, and the number of irrelevant
nodes in intermediate storage can still be very large. Another combined approach is represented
by the TJStrictPost and TJStrictPre algorithms (Grimsmo et al. 2010).
TJStrictPost uses one global stack for postorder node processing and one stack per query node
for prefix query match filtering. Therefore, each node is stored twice in filtering data struc-
tures. TJStrictPost applies prefix match filtering before a node is inserted into the filtering data
structures. Therefore, the number of nodes inserted into intermediate storage is equal to that of
Twig2 Stack+PathStack. Whether a node is useless or matched is decided when it is popped out
from the global stack, and the matched node is then inserted into intermediate storage. As a result,
nodes are stored in level split vectors in postorder, which means that output enumeration produces
an output that has to be sorted.
TJStrictPre uses a combination of weak match filtering and level split vectors. It introduces an
improved filtering function called getPart. It remembers the last weakly matched node for each
query node returned by the previous getPart call, which allows faster useless node skipping when

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:19

Table 5. Comparison of Holistic Joins

Compatibility
Algorithm Type Filtering Query model category
TwigStack Weak match + Prefix TPQ A
iTwigJoin Weak match + Prefix TPQ A
XTwigStack Top-down Weak match + Prefix X-TPQ A
TwigStackList Weak match + Prefix TPQ B
TJFast Prefix + Weak match TPQ∗ C
Twig2 Stack Subtree match GTP D
Twig2 Stack + Prefix + Subtree match GTP D
PathStack Bottom-up
TwigList Subtree match TPQ D
TJStrictPost Prefix + Subtree match TPQ D
TJStrictPre Weak Match + Prefix TPQ A, D
TreeMatch Combined Prefix + Weak match TPQ >,∗ C
GTPStack Weak match + Prefix + GTP A, D
Subtree match

compared to TwigStack’s getNext filtering function. TJStrictPre stores nodes in its intermediate
storage as soon as they are pushed onto the filtering data structure (a stack of query nodes). There-
fore, nodes in level split vectors are stored in preorder, and the output after output enumeration
does not have to be sorted. However, two passes through level split vectors are needed.
The TreeMatch algorithm (Lu et al. 2010) is an extension of the TJFast algorithm (Lu et al.
2005a) for a larger set of query constructs and with optimizations related to output query nodes.
TreeMatch and TJFast represent a family of algorithms that require a labeling scheme supporting
the LCA operation mentioned in Section 2.3. TreeMatch includes support for TPQ∗, > and semantics
related to boolean expressions. Moreover, TreeMatch applies subtree match filtering. TreeMatch
stores nodes on the ancestor’s stacks until it is certain that the nodes correspond to an output.
This approach avoids the solution sorting used by TJFast and enables TreeMatch to be optimal for
queries having only ADs in nonoutput branching edges.
GTPStack (Bača et al. 2012) is another approach combining top-down and bottom-up filtering.
It is capable of processing a GTP with linear worst-case I/O complexity with respect to the GTP
result size. It further improves the getPart filtering function by using dynamic programming to
avoid unnecessary calls, which reduces the number of holistic algorithm iterations. The number
of iterations performed in Figure 8 would be reduced to six since the first two iterations can be
performed in one getPart call. GTPStack uses level split vectors as intermediate storage for fast
output enumeration.

4.2.4 Comparison of Holistic Approaches. In Table 5, we compare the existing holistic ap-
proaches based on the following criteria:
— Filtering: This criterion shows filtering applied before a node is inserted into intermedi-
ate storage. We consider the worst-case scenario of the algorithm’s filtering. For example,
most top-down algorithms utilize weak match filtering in Table 5; however, when a top-
down algorithm is optimal for a query, then it performs match filtering (i.e., it filters out all
nodes that are irrelevant to a query). A comparison of filterings according to their selectiv-
ity is provided in Figure 12. We use the term “selectivity” to define how the method filters

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:20 R. Bača et al.

Fig. 12. Selectivity of various filterings.

irrelevant nodes (high selectivity means that the method is more successful). The most com-
mon variant of filtering is a combination of weak match filtering (the getNext/getPart
filtering function) and prefix match filtering, while another common filtering is a subtree
match. The weak query match and prefix query match filtering combination is typical for
top-down approaches, and the subtree match filtering is typical of bottom-up approaches.
Note that a more selective filtering does not usually mean faster query processing due to
generated overhead. The major advantage of a more selective filtering is a lower number
of nodes in intermediate storage (often significantly lower). From this perspective, GTP-
Stack achieves the lowest number of nodes in intermediate storage due to its combination
of filterings.
— Query model: This feature specifies the query model supported.
— Compatibility: We assign a compatibility category symbol (A–D) to each holistic approach.
Two algorithms with the same compatibility category symbol can be combined into one
algorithm with minimal effort. Each algorithm is in one of the four categories in Table 5
except for TJStrictPre and GTPStack, which are combinations of algorithms from the A and
D categories; therefore, they can exploit the advantages of both.
It is not easy to compare holistic algorithms according to their real processing time using a
single criterion because each algorithm has its own difficulties. The major difficulty of the top-
down algorithms is their enumeration algorithm. Conversely, the problem of the bottom-up and
combined approaches is that they can store an enormous number of useless nodes in intermediate
storage. From this point of view, combined approaches seem to be closest to an optimal holistic
algorithm from the query processing time point of view.

4.3 Orthogonal Techniques


Last, but not least, we describe techniques that can be quite easily incorporated into any holistic
or binary join.
Partitioning. If a join uses partitioning X (e.g., tag partitioning), it can easily use also parti-
tioning Y , where Y is a decomposition of X (Bača and Krátký 2009). In other words, the selection
of partitioning is typically orthogonal to the selection of a twig join since most joins describe their
functionality with respect to the tag partitioning.
Useless node skipping. Every top-down algorithm performs useless node skipping (i.e., skip-
ping many useless nodes at the same time during an algorithm run) when reading a sequence.
For example, a join has a node n and it needs to skip all nodes in a sequence preceding n. If no
index is available, the useless node skipping is executed during a sequential scan. As mentioned
in Section 3.1.3, we can index sequences (lists) of the partition index using, for example, XB-tree
or XR-tree (Bruno et al. 2002; Jiang et al. 2003a) in order to speed up useless node skipping. These
indices can be used in any join; however, different joins can produce a very different number of
jumps, which is an important parameter for the efficiency of an XB-tree or XR-tree index. In other
words, if a join causes many short jumps, the advantage of such an index is negligible. Some other

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:21

works (Jiang et al. 2003b; Fontoura et al. 2005) describe the optimization of useless node skipping
in a holistic join. The proposed algorithms look for a possibly longest jump that can significantly
decrease the number of I/O accesses as well as node label comparisons.
LCA operation. Section 2.3 depicts the possibility of processing the LCA operation in some la-
beling schemes. TJFast (Lu et al. 2005a), virtual cursor (Yang et al. 2004), TJDewey (Bača and Krátký
2009), and TreeMatch (Lu et al. 2010) utilize the LCA operation to avoid reading sequences corre-
sponding to non-leaf query nodes. Such an approach can avoid reading a lot of nodes; however,
node labels in such a labeling scheme can occupy quite a lot of space by themselves. Therefore,
a node label compression technique should be applied (Härder et al. 2007; Bača et al. 2010). The
advantages of the LCA operation can also be integrated into other algorithms. However, when
compared to an integration of partitioning, LCA operation support needs some changes in an al-
gorithm, and it is not often straightforward. Moreover, the assumption for the use of the LCA
operation and the avoidance of reading a non-leaf query nodes’ sequence is that the labeled path
for each node is accessible during an algorithm run. In practice, it can be difficult to satisfy this
assumption.
Boolean expressions. A number of works deal with TPQs containing OR and NOT con-
structs (Jiao et al. 2005; Jiang et al. 2004; Yu et al. 2006; Che et al. 2012; Izadi et al. 2012; Bača
et al. 2012; Lu et al. 2010). These works avoid a split of a TPQ into more queries. Generally, to sup-
port boolean expressions, we need to normalize a TPQ (Che et al. 2012) and include this support in
node filtering. Once a TPQ is normalized, it is easy to include boolean expressions into a filtering
function (Che et al. 2012) and also into subtree match filtering as well (Bača et al. 2012; Lu et al.
2010).
Data parallelization. There are works dealing with the parallelization of twig query join al-
gorithms (Pan et al. 2008; Machdi et al. 2009, 2010; Tian et al. 2013; Shnaiderman and Shmueli
2015). These works use so-called data partitioning, which means that they partition the sequences,
and each thread works with its partition. Note that this is a different type of partitioning than the
document partitioning depicted in Section 3.1.1. It is easy to integrate data partitioning into any
twig query join, and Shnaiderman and Shmueli (2015) shows that the speed-up can be surprisingly
more than linear with respect to the number of threads because of the ability to skip unnecessary
parts of sequences during partitioning.

5 XML ALGEBRAS
In Section 4, we introduced several algorithms for TPQ processing (i.e., twig query joins). However,
it might not be clear how the TPQ model can be incorporated into XML algebra. In this section,
we describe various XML algebras and their compliance with the TPQ model.
Relational algebra is one of the main components of RDBMSs and plays an important role in
their success and widespread usage. A corresponding XML algebra has the same importance and
plays a substantial role in XML query processing. Many proposals for XML algebras have been
introduced (Jagadish et al. 2001; Chen et al. 2003; Grust et al. 2004; Deutsch et al. 2004; Brantner
et al. 2005; Re et al. 2006; Özcan et al. 2008). In general, existing XML algebras fall into two main
classes depending on the data model they use:

— Tree-based algebra: This class of XML algebras represents XML data as forests of labeled
ordered trees (Jagadish et al. 2001; Al-Khalifa and Jagadish 2002; Chen et al. 2003) and pro-
vides a natural support for the twig query joins introduced in Section 4 and XML-specific
optimizations.
— Tuple-based algebra: This class of XML algebras operates similarly to relational algebra
on sets of tuples (Brantner et al. 2005; Deutsch et al. 2004; Grust et al. 2004; Re et al. 2006;

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:22 R. Bača et al.

Fig. 13. Categories of XML algebras.

Özcan et al. 2008) and extends it to meet the requirements of the XQuery data model. There
are two types of tuple-based algebras, depending on the tuple value allowed: (1) a value
can only be a null, an atomic value, or an XML node (Grust et al. 2004) (see Section 5.2.1);
and (2) a value can also be a sequence of values or a sequence/bag of tuple values (see
Section 5.2.2).
We summarize the basic categories of XML algebras in Figure 13 and describe them in more
detail in the following sections.

5.1 Tree-Based Algebra


TAX (Tree Algebra for XML) (Jagadish et al. 2001; Al-Khalifa and Jagadish 2002; Chen et al. 2003)
is a query algebra developed in the context of the TIMBER project (Jagadish et al. 2002). A TAX
data model is based on an unordered collection of ordered data trees with a similar structure. The
TAX collection is analogous to a relation in relational algebra but is more complex. TAX uses the
TPQ model, and many operators are defined with respect to the model (i.e., they have a TPQ as an
input). All operators have TAX collections on their input as well as on output; therefore, TAX has
the same composability and closure attributes as traditional relational algebra. A big advantage of
TAX is that it naturally supports the concept of TPQ, and, as a result, all algorithms mentioned
in Section 4 can be easily integrated into it. Moreover, it has a small set of operators and all of
them are defined on one domain, which is a TAX collection. The main disadvantage of TAX is that
TAX collections need a native XML query processor written from scratch, and the use of existing
relational query rewritings and optimizations is not fully possible. It is also more difficult to work
efficiently with a TAX collection than with a relation.

5.2 Tuple-Based Algebra


There is a family of algebras that works on tuples (Fiebig et al. 2002; Grust et al. 2004;
Brantner et al. 2005; Re et al. 2006; Manolescu et al. 2009). Every tuple has a homogenous struc-
ture ($a 1 = val 1 , . . . , $am = valm ), where $ai is a variable name, and a value vali can be either a
null value (usually denoted as ⊥) or a node in an XML tree or an atomic value. However, some
algebras (Brantner et al. 2005; Re et al. 2006; Manolescu et al. 2009) also allow a value vali to be
a sequence of values or even a sequence/bag of tuples of values. Tuple-based algebras include all
operators of relational algebra such as selection, projection, join, union, and so on. In addition to
relational operators, XML tuple-based algebras also contain other operators that can be divided
into two categories: (1) those for navigation in an XML document and (2) those for construction
of XPath/XQuery data model values from the tuples.
5.2.1 Basic Algebras. The work of Grust et al. (2004) is the major representative work on tuple-
based algebra where only nulls, atomic values, or XML nodes are allowed in tuple values. It presents
a relational algebraic approach for compiling XQuery expressions and is developed in the context
of the Pathfinder project (Grust et al. 2008). This approach is fully relational, and each XQuery is
translated into an SQL query. Most queries use relatively simple and standard relational operators

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:23

Fig. 14. Tuple-based algebraic operators.

that allow the query optimizer to employ relational algebraic optimization techniques such as
a pushdown of the selection operator. Figure 14 lists the Pathfinder algebraic operators that are
designed to receive one or more inputs and produce one or more outputs. These inputs and outputs
are sets of tuples. Most of the operators perform quite simple and standard relational operations
(e.g., selection, projection, cartesian product, join, union) that allow the query optimizer to employ
the usual relational algebraic optimization techniques (Sakr 2007). For example, the XPath Location
Step Evaluator operator applies the evaluation conditions of an XPath location step over the context
nodes and returns the result as a sequence of tuples corresponding to output nodes.
5.2.2 Nested Algebras. As mentioned earlier, there are some works (Brantner et al. 2005; Re
et al. 2006; Manolescu et al. 2009) that allow nonatomic values. As a result, such an algebra can-
not be integrated into a relational query processor as easily as the basic tuple-based algebras. The
major reason for these extensions of the basic algebras is the possibility of building more efficient
query plans. Brantner et al. (2005) presented NAL algebra (Natix Algebra) for an algebraic com-
pilation of an XPath expression supporting a sequence of tuple values as a value. NAL algebra is
developed in the context of the Natix project (Fiebig et al. 2002), and it introduces support for all
XPath constructs. It can easily incorporate only navigational binary structural joins mentioned in
Section 4.1.2. To overcome this problem, Mathis (2007) introduced an extension of NAL in order to
support the TPQ, and, therefore, all the algorithms described in Section 4. Re et al. (2006) present
an approach that allows a value to be a sequence of values. These sequences are called XML val-
ues, and they correspond to sequences of items in the XQuery data model. The algebra contains
three basic types of operators: (1) operators on XML values, (2) operators on tuples, and (3) oper-
ators allowing XML values to be mapped to tuples and the other way around. One of the newer
works (Manolescu et al. 2009) describes a unified algebra (UA) that is similar to that described in
another work (Re et al. 2006). UA is a combination of operators specific for XML values (i.e., a
sequences of XML items) and relational algebra. UA allows homogeneous sequences of tuples or
bags of tuples in a value.
All XML algebras mentioned in this section claim that it is easier to identify and apply promising
optimization techniques such as group by on complicated and nested XQueries. Another advantage

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:24 R. Bača et al.

Table 6. Comparison of the XML Algebras

Twig query Relational


Name of algebra Language joins compliance Major advantages
TAX (Jagadish et al. 2001) AD-XQuery all ✗ A simple algebra with few op-
erators.
GTP (Chen et al. 2003) AD-XQuery all ✗ A query model (GTP) cov-
ering all major XQuery con-
structs.
Algebra of (Grust et al. 2004) XQuery Nav BJ  An RDBMS is sufficient for its
run. No twig query joins are
needed (however, they can be
integrated).
NAL (Brantner et al. 2005) XPath BJ ✗ A natural support of item
NAL+ (Mathis 2007) XPath all ✗ sequences which enables an
Galax (Re et al. 2006) XQuery BJ ✗ algebraic optimization on
UA (Manolescu et al. 2009) XQuery all ✗ complicated nested queries.

is that we can incorporate all TPQ algorithms introduced in Section 4; we only need to identify a
TPQ in a query plan (Michiels et al. 2007).

5.3 Comparison of XML Algebras


All existing algebras have the same goal: to formalize query processing into a set of elementary
operators that subsequently simplify the cost-based optimization of a query plan. Table 6 compares
the algebras based on the following criteria:

— Language: This criterion specifies whether the algebra supports XPath or XQuery. More-
over, there are algebras supporting XQuery restricted only to the AD and PC relationships
(called AD-XQuery).
— Twig query joins: We analyze what kind of twig query joins the algebra supports. We use
the abbreviations BJ and Nav BJ for a binary join and a navigational binary join, respectively
(see Sestion 4.1). The “all” shortcut means that both twig query join types (i.e. binary and
holistic) are supported.
— Relational compliance: We analyze whether an algebra uses a data model fully compatible
with the relational data model or not. In other words, this criteria is just to determine if it is
possible to map the XML algebra operators on the relational operators without substantial
work.
— Major advantages: We highlight the major advantages of each approach.

As shown in Table 6, UA is the only algebra supporting XQuery and all the twig query joins.
However, if we apply a TPQ identification in a query plan (Michiels et al. 2007), then holistic twig
joins can be used in Galax algebra as well.

6 XML QUERY PROCESSING OPTIMIZATIONS


In this section, we discuss various optimization techniques and selectivity estimation methods that
are specific for structural XML query processing.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:25

Table 7. Categories of Twig Query Joins

Algorithm type
Merge-like Navigational
Binary BM (StackTree (Al-Khalifa et al. 2002)) B N (StairCase (Grust et al. 2003))
joins
Holistic H M (TwigStack (Bruno et al. 2002)) H M N (CostTwigJoin (Bača et al. 2015))
joins H M N (CostTwigJoin (Bača et al. 2015))

6.1 Cost-Based Optimizations


There are two basic levels where XML query plan optimization occurs: (1) a logical level, where
logical query plan rewriting is performed (i.e., unnesting (Astrahan and Chamberlin 1975)), and
(2) a physical level, where an appropriate access method is selected.

6.1.1 Logical Level Optimizations. XML algebras usually describe a set of possible rewritings
and promising query plan transformations (Re et al. 2006; May et al. 2004). The most significant
type of rewriting is unnesting. This is a well-known in RDBMSs that typically rewrites a depen-
dent nested query into an independent one which can subsequently lead to a significant query
processing speed-up. Wu et al. (2003) introduced dynamic programming algorithms searching for
an optimal binary structural join order. This approach is based on the fact that we can process
a TPQ though many equivalent query plans consisting of binary structural joins and sort opera-
tors. Che et al. (2006) introduced a set of algebraic XPath transformations for query optimizations
that are based on an extended PAT algebra (Salminen and Tompa 1994) originally developed for
search on semi-structured text. The algebraic transformations are based on an XML schema or
XML document properties and on the existence of the so-called structural index. The term “struc-
tural index” is used by Che et al. (2006) to describe a materialized join between a pair of XML
elements in an XML schema. However, structural index selection is not mentioned in the same
study even though it can be crucial to the technique’s applicability.

6.1.2 Physical Level Optimizations. Cost-based optimizations at the physical level focus mainly
on a selection between navigational and merge-like algorithms (see Section 4.1). However, a twig
query join (structural binary or holistic) selection is also important. Table 7 summarizes the basic
categories of twig query joins from the cost-based perspective, and for each category it gives one
typical algorithm.
Halverson, Burger, Galanis, Kini, Krishnamurthy, Rao, Tian, Viglas, Wang, Naughton, et al.
(2003), Zhang et al. (2005), and Georgiadis et al. (2010) introduced XML systems that are capable
of using both types of binary structural joins (B M and B N in Table 7) during the query processing.
Each XML system builds a query plan consisting of a set of binary structural joins, where an
appropriate type of join is selected according to the selectivity. All these works focus on XPath
processing, and they are based on a simple observation that a navigational algorithm is superior to
a merge-like algorithm if the percentage of useful nodes from an input sequence is low. This cost-
based optimization is similar to the selection of an appropriate equi-join algorithm in RDBMSs,
where we select between the merge join and the nested-loop join. Whereas the work by Halverson
et al. (2003) is one of the first in this field, Zhang et al. (2005) uses a cost-based model based on a
statistical regression. Georgiadis et al. (2010) introduced a data structure API similar to that defined
in Table 2. The API allows both types of indices to be abstracted (i.e., document and partition
indices depicted in Section 3.1) to a set of simple operations, and they evaluate several various
index configurations and physical operators.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:26 R. Bača et al.

Table 8. A Comparison of Cost-based Approaches

Categories of twig
Approach join algorithms Major advantages
(Wu et al. 2003) BM Resolves optimal binary structural
join ordering
(Halverson et al. 2003) BM BN Baseline for other approaches
combining B M and B N
(Zhang et al. 2005) BM BN Adaptive to a query workload
(Georgiadis et al. 2010) BM BN Contains abstraction of basic
operations which makes the solution
general
(Weiner and Härder 2010) BM HM Integrates holistic joins into query
plans
(Bača et al. 2015) HM N Integrates navigational query
processing into a holistic join

Weiner and Härder (2009, 2010) considered the advantages of a combination of binary and
holistic joins. In Weiner and Härder (2009), we can find one of few direct comparisons between
binary and holistic joins. The authors search for a query selectivity threshold when a holistic
join becomes more efficient than a plan with binary structural joins. Their results indicate that
highly selective queries are performed more efficiently using holistic joins. Weiner and Härder
(2010) described the approach for using this knowledge to build a query plan combining binary
and holistic structural joins. Bača et al. (2015) incorporated the navigational query processing into
a holistic join. In this way, the search space of possible plans is reduced, and so an optimal query
plan selection can be simplified when compared to approaches based on binary joins.
Since there are many various XML indices, the authors of Elghandour et al. (2013) presented a
design advisor for recommending the indices and materialized views.
6.1.3 Comparison of Cost-Based Optimizations. Table 8 lists the major cost-based approaches
and shows the categories of join algorithms used by these approaches. In general, many approaches
are based on the idea that navigational algorithms are faster than merge-like algorithms if the
selectivity is high (Halverson et al. 2003; Zhang et al. 2005; Georgiadis et al. 2010). Other ap-
proaches (Weiner and Härder 2010; Bača et al. 2015) take into account the cost of holistic joins
in their cost-based optimization methods.

6.2 Selectivity Estimation Techniques


Estimating the size of a query is a crucial part of any effective query optimization method. The
selectivity estimation problem in the XML domain is more complicated than that in the relational
domain. There are several reasons behind this complexity:
— The absence of a strict schema in the XML data
— The dualism between structural and value-based querying
— The high expressiveness of the XML query languages
— The nonuniform distribution of tags and data
— The correlation and dependencies between occurrences of nodes
In general, the selectivity estimation of TPQs is equivalent to the selectivity estimation of a
relational join on several tables with selection predicates, which is recognized as a difficult

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:27

problem (Polyzotis et al. 2004a). Several research efforts proposed various selectivity estimation
techniques in the XML domain (Aboulnaga et al. 2001; Polyzotis and Garofalakis 2006b; Zhang
et al. 2006; Fisher and Maneth 2007; Polyzotis and Garofalakis 2006a; Teubner et al. 2008). These
techniques can be broadly classified into four main classes in terms of a structure used for
collecting the summary information:

— Synopsis-based estimation techniques: This class uses a tree or graph to represent the
summary information of an XML document (Aboulnaga et al. 2001; Lim et al. 2002; Polyzotis
and Garofalakis 2002, 2006b; Zhang et al. 2006; Polyzotis and Garofalakis 2006a; Fisher and
Maneth 2007).
— Histogram-based estimation techniques: This class uses statistical histograms to cap-
ture the summary information of an XML document (Freire et al. 2002; Wu et al. 2002; Li
et al. 2006).
— Sample-based estimation techniques: These approaches derive a sample from an XML
document, and this sample is subsequently used for the selectivity estimation (Luo et al.
2009).
— Algebra-based estimation techniques: This class uses the existing machinery from the
relational domain (Sakr 2008; Teubner et al. 2008).

6.3 Synopsis-Based Estimation Techniques


The work by Aboulnaga et al. (2001) is considered to be the first to deal with a selectivity estima-
tion of simple path expressions. It presented two different techniques. The first technique uses a
schema tree structure (see Section 3.1.2) extended with node counts. Figure 15(a) and (b) illustrate
an example of an XML tree and its extended schema tree, respectively. The second technique is
a statistical structure called a Markov table. This table, implemented as a hash table, contains all
distinct labeled paths (of the length up to m) together with their frequencies. It has an acceptable
estimation error for XML documents with a regular structure.
XPathLearner (Lim et al. 2002) has been presented as a selectivity estimation of TPQs which
employs the same techniques presented in Aboulnaga et al. (2001), only with some modifications.
The first difference in XPathLearner is that, instead of scanning an XML document, it gathers
and refines its statistical structure in an online manner from the query result (i.e., it is workload-
aware). The second difference is that it supports handling of value predicates by storing statistical
information for each distinct tag-value pair in an XML document. PathTree (Alrammal et al. 2011)
is another approach based on the same schema tree principles of Aboulnaga et al. (2001). It has
been extended to support the TPQ model and describes an algorithm for efficient updates. How-
ever, the PathTree selectivity estimation algorithm is not depicted very clearly (the algorithm is
explained using an example with many inaccuracies), and the experiments lack a description of
testing queries or at least their features.
Mohammed et al. (2015), El-Alfy et al. (2015), and Mohammed et al. (2016) used a synopsis tree
that bears certain similarities to F&B partitioning. A node equivalence is defined by a complex
hash function by taking into consideration the descendants and ancestors of a node. As a result,
the synopsis tree is more compact when compared to F&B, and the reported estimation accuracy
is high (an error never exceeds 5% for tested queries); however, the space budget for the TreeBank
collection (University of Washington Database Group 2002) (87 MB in size), with its highly recur-
sive and irregular structure, is unacceptably large (i.e., more than 10% of the original document
size).
There exists a group of similar synopsis-based estimation techniques (Polyzotis and Garofalakis
2002, 2006a; Polyzotis et al. 2004b; Zhang et al. 2006) having a highly accurate selectivity

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:28 R. Bača et al.

Fig. 15. (a) XML tree. (b) Schema tree with corresponding node counts. (c) XSeed kernel. (d) Binary repre-
sentation of XML document and corresponding SLT grammar. (e) Mapping tables of Li et al. (2006).

estimation (i.e., they provide a low estimation error) when compared to most other techniques.
They first create a compact graph statistical structure called a kernel from an XML document, sub-
sequently updating this kernel according to a query workload. TreeSketch (Polyzotis et al. 2004b)
is the earliest technique of the group. It provides not only a selectivity estimation but also an ap-
proximate result of an TPQ. The XSeed technique (Zhang et al. 2006) addresses mainly recursive
XML documents and recursive queries. Both techniques handle only TPQs without value predi-
cates. The main advantages of both techniques is the small size of a statistical structure and a low
estimation error. On the other hand, they both report a long build time of the TreeBank collection.
For example, Luo et al. (2009) reported on the inability to create the estimation structure for Tree-
Bank within days for both techniques (i.e., TreeSketch and XSeed). The XSeed statistical structure
is constructed by starting with a small kernel having a size and shape similar to the schema tree
of an XML document; this is then incrementally updated through the feedback of queries. The
kernel as such is represented in the form of a label-split graph summary structure proposed by
Polyzotis and Garofalakis (2002). In this graph, each edge e = (u, v) is labeled with a vector of
integer pairs (p0 :c 0 , p1 :c 1 , . . . , pn :c n ), where the ith integer pair (pi :c i ) indicates that, at recur-
sion level i, there are pi and c i elements mapped to the synopsis vertices u and v, respectively.
Figure 15(c) illustrates an example of the XSeed kernel for the XML tree in Figure 15(a).
In Fisher and Maneth (2007), a Straight-Line Tree (SLT) grammar statistical structure is proposed.
This work uses the well-known idea of removing repeated patterns in a tree by removing multiple
occurrences of equal (or similar) subtrees and replacing them with pointers to a single occurrence
of the subtree. The synopsis is constructed by using a tree compression algorithm to generate a
minimum unique Directed Acyclic Graph (DAG) of a binary XML tree (see example in Figure 15(d)),

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:29

and then represents resulting DAG structures using a lossy SLT grammar. Figure 15(d) illustrates
an example of the binary XML tree corresponding to the XML tree in Figure 15(a) and the cor-
responding SLT grammar. An estimation algorithm is designed for a tree automaton to run over
the lossy SLT grammar. XML queries are converted into its equivalent tree automaton and evalu-
ated. The proposed synopsis of this work supports the estimation of all XPath axes, including the
order-sensitive ones, using a memory space efficiently; however, it can deal only with simple path
expressions.
XCluster (Polyzotis and Garofalakis 2006b) is considered a generalized form of the XSketch tree
synopsis, which is a previous work of Polyzotis and Garofalakis (2002). However, the XCluster
synopsis captures correlations also between a structure and values. The approach uses histograms
to approximate the value distribution. The build time of the XCluster synopsis is not included in
the paper; however, since XSketch has a build time of within a few days for TreeBank (Li et al.
2006), we can expect similar properties for XCluster as well.

6.4 Histogram-Based Estimation Techniques


StatiX (Freire et al. 2002) is a selectivity estimation that leverages the available information in an
XML schema to capture both structural and value statistics about XML documents. These struc-
tural and value statistics are collected in the form of histograms. The StatiX system is employed
in LegoDB (Bohannon et al. 2002) (see Section 3.2.2). The StatiX system consists of two main com-
ponents. The first component is an XML schema validator that simultaneously validates an XML
document against its schema and creates its statistical structure. It assigns globally unique identi-
fiers (IDs) to all instances of types defined in the XML schema. Using these assigned IDs, struc tural
histograms are constructed to summarize information about connected edges. Value histograms
are constructed for types defined in terms of base types, such as integers. The second component is
an XML schema transformer enabling the statistical structure to have different levels of granularity.
Wu et al. (2002) presented an approach for selectivity estimation that relies on the Containment
labeling scheme presented in Section 2.3. For each basic predicate P, a two-dimensional histogram
summary data structure is built and named a position histogram. In position histograms, start and
end values correspond to the x and y axes, respectively. Each grid cell in the histogram represents
the number of nodes satisfying the conditions of the predicate P and having start and end positions
within the specified ranges of the grid cell. This selectivity estimation technique has several limita-
tions: (1) only AD relationships are considered, therefore, PC relationships are handled as AD; (2)
the number of constructed position histograms is proportional to the number of predicates, which
results in limited scalability (Freire et al. 2002); and (3) it lacks any support for updates (Zhang
et al. 2006).
Li et al. (2006) described a framework for estimating the selectivity of XPath expressions
with the main focus on the order-based axes (following, preceding, following-sibling, preceding-
sibling). The approach works with two mapping tables, two frequency tables, and two approxima-
tion structures. The mapping tables are (1) an encoding table assigning an integer to each distinct
root-to-leaf path and (2) a path id table assigning an integer to each node type, where a node type
has a unique bit sequence specifying a set of root-to-leaf paths corresponding to the node type.
Figure 15(e) depicts the mapping tables corresponding to the XML tree in Figure 15(a). The fre-
quency tables are (1) a pathId frequency table storing frequencies of node types in an XML doc-
ument and (2) a path-order table capturing the sibling order information based on node types.
Since these frequency tables are quite large in practice, Li et al. (2006) presented an approach to
approximate them using p-histogram and o-histogram structures (approximating the pathId fre-
quency table and the path-order table, respectively). Li et al. (2006) experimentally compared their
work with XSketch (Polyzotis and Garofalakis 2006a) and showed a comparable estimation error

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:30 R. Bača et al.

while having comparable statistical structure size and significantly lower build time. Nevertheless,
the storage requirements for the path id table grow slightly faster than linearly with a number of
root-to-leaf paths, which is not feasible for less regular XML documents. In the case of TreeBank,
the path id table occupies megabytes of storage space since it has several tens of thousands of
root-to-leaf paths, whereas the statistical structures usually only occupy a few kilobytes.

6.5 Sample-Based Estimation Techniques


Luo et al. (2009) presented a sampling-based technique that creates a sample by randomly remov-
ing subtrees corresponding to a tag name if the number of subtrees of the tag is high. Despite the
simplicity of this approach, the authors report an estimation error that is comparable to XSeed
and TreeSketch while having a comparable space budget and a significantly lower build time. As
usual in sampling-based methods, it can have problems with skewed data; on the other hand, it can
easily support updates, can be built for irregular XML data collections in a short time, and it also
support complex XQuery queries. Another advantage is that we can perform a query on a sample
and thus also have the ability to observe some interesting parameters for a cost-based optimiza-
tion that would be quite difficult to obtain in another way (e.g., the number of node comparisons
during the algorithm run (Bača et al. 2015)).

6.6 Algebra-Based Estimation Techniques


The design and implementation of a relational algebra–based framework for selectivity estimation
is described in Sakr (2008) and Teubner et al. (2008). The main idea of these approaches is to use
existing machinery from the relational domain for a selectivity estimation. In particular, XML
queries are translated into relational algebra plans (Grust et al. 2004). A peephole-style analysis of
these relational plans is performed to annotate each operator with a set of special properties (Grust
2005). These annotations are produced during a single pass over a relational plan and use a set of
lightweight inference rules that are local in the sense that they only depend on the operator’s
plan inputs. These approaches can integrate any selectivity estimation technique mentioned in
the previous sections.

6.7 Comparison of Selectivity Estimation Techniques


Table 9 compares the selectivity estimation techniques mentioned in the previous sections. We
compare these techniques in terms of:
— Language (Lang.): We resolve what part of the XPath/XQuery languages the technique
covers. Except the self-describing TPQ, TPQ >> , and TPQ∗ , we mention also a “path,” which
is an abbreviation for simple path expressions.
— XML schema (Sch.): We detect whether the technique can work with schema-less XML
documents.
— Updatability (Upd.): We discuss whether it is possible to simply accommodate updates to
an XML document.
— Adaptivity (Adap.): We study whether the technique is adaptive with respect to the work-
load.
—Values (Val.): We analyze whether the technique considers the selectivity estimation of
expressions containing values (i.e., whether the technique is capable of capturing some cor-
relation between the structure and the node content).
— Build time: Unfortunately, sizes of XML collections used in experiments differ, therefore,
their build times are not directly comparable. Therefore, we consider the build time “low” if

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:31

Table 9. Comparison of the Selectivity Estimation Techniques


Build
Technique Lang. Sch. Upd. Adap. Val. time Features
(Aboulnaga et al. Path   ✗ ✗ low A simple approach having an
2001) acceptable estimation error for
DBLP.
(Alrammal et al. TPQ∗    ✗ low Extension of Aboulnaga et al.
2011) (2001) to support TPQ∗ and
updates. Vague algorithm
description and experiments.
(Mohammed et al. TPQ∗  ✗ ✗ ✗ med A large statistical structure for
2015, 2016) irregular XML documents. High
accuracy.
(Lim et al. 2002) Path    ✗ med. Its statistical structure is built
from the query output not from
an XML, therefore, it is not
suitable for ad-hoc queries.
(Polyzotis et al. TPQ  ✗  ✗ high High accuracy.
2004b)
(Zhang et al. 2006) Path∗  ✗  ✗ high Higher accuracy than Polyzotis
et al. (2004b) for recursive
queries.
(Fisher and Maneth Path   ✗ ✗ low Supports all XPath axes.
2007)
(Polyzotis and TPQ  ✗   high Enables even full-text selectiviy
Garofalakis 2006b) estimation.
(Freire et al. 2002) TPQ ✗    med. Quality of an estimation error is
dependent on the distribution of
schema types.
(Wu et al. 2002) TPQ  ✗ ✗ ✗ uknown The space complexity for ad-hoc
queries can be quite high.
(Li et al. 2006) TPQ > >  ✗ ✗ ✗ med. Large space needs for XML
collections with a lot of
root-to-leaf paths.
(Luo et al. 2009) TPQ   ✗  low Lower accuracy for a skewed
data but applicable even for
TreeBank.

experiments mention the build time (the experimental build time) of 10MB DBLP or XMark
collections within seconds, or if the build method has clearly linear complexity with respect
to XML document size. We consider the build time “medium” (“med.”) if the experimental
build time of 10MB DBLP or XMark collections is within minutes, if the build method has
clearly polynomial complexity, or if accurate results need iterative learning for accurate
predictions. We consider the build time “high” if the experimental build time of 10MB DBLP
or XMark collections is within hours or if the TreeBank collection cannot be built within a
day. We always consider the worst build time. Note that in the case of Wu et al. (2002) there
is not enough data to specify its build time.
In Table 9, we did not include algebra-based techniques (Sakr 2008; Teubner et al. 2008) since
these high-level selectivity estimation techniques can support any TPQ selectivity estimation tech-

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:32 R. Bača et al.

nique. We also did not consider the accuracy of any selectivity estimation since the comparison of
accuracy is even more fuzzy than the comparison of build time. In general, techniques that have
been constantly reporting a low estimation error (Li et al. 2006; Luo et al. 2009) are kernel-based
techniques such as TreeSketch (Polyzotis et al. 2004b) or XSeed (Zhang et al. 2006). Kernel-based
techniques try to avoid join independence assumptions in their synopsis as much as possible,
which leads to high accuracy. On the other hand, their build time is long and it dramatically in-
creases with irregularity of an XML document structure.
In principle, not many techniques capture a correlation between the structure and the node
content. For example, the accuracy of the sampling-based technique (Luo et al. 2009) is highly
dependent on the size of a sample and the actual value distribution. In StatiX (Freire et al. 2002),
relational value histograms are used to make the selectivity estimation of XML queries with values
more accurate. On the other hand, the necessity of an XML schema can be a limitation. XClus-
ter (Polyzotis and Garofalakis 2006b) is a combination of a highly accurate kernel approach with
value histograms; however, the build time can be a limitation for some XML collections.
As can be observed, there is no single technique that is appropriate for all types of XML col-
lections and XML queries. Therefore, we believe that an XDBMS should include at least two tech-
niques that complement each other for an accurate selectivity estimation, with one of the tech-
niques using a kernel (Polyzotis and Garofalakis 2002, 2006a; Polyzotis et al. 2004b; Zhang et al.
2006; Polyzotis and Garofalakis 2006b) since it provides high accuracy and an acceptable build time
for regular XML documents, and another technique using the sampling-based approach (Luo et al.
2009) since it has a low build time that is useful for irregular XML documents, and it supports
XQuery. If the necessity of an XML schema is not a limitation, StatiX (Freire et al. 2002) is also a
good choice that does not need to be complemented by any other technique.

7 CONCLUSION AND FUTURE WORK


In this article, we focused on the various structural querying aspects necessary for the design and
development of an XML query optimizer and the storage of XML data. Although we consider TPQ
as a major structural query model, we also take GTP into account since it includes other features
of XQuery not covered by TPQ (e.g., output query nodes). We specifically reviewed methods of
XML query processing and XML data handling in a DBMS. We showed the continuity of methods
of different XML processing areas, and we placed additional emphasis on a thorough comparison
of all methods of each area.
To review different processing of XML queries in DBMSs, we described the following:

— XML algebras: All existing query algebras have the same goal: to formalize query processing
into a set of elementary operators that will subsequently simplify a query plan cost-based
optimization. We distinguish that only unified algebra supports XQuery and all twig query
joins (binary as well as holistic joins).
— Twig query joins: We have shown that at some points of processing it is more efficient to use
a navigational algorithm instead of a merge-like algorithm when binary structural joins are
used. We also showed that holistic joins overcome the problem of possibly large interme-
diate results produced by binary structural joins. We abstracted the features of twig query
joins from the data structures, and we described their advantages mainly with respect to
the ability to efficiently filter input sequences. Although there are many holistic join algo-
rithms with various features, combined approaches seem to be closest to an optimal holistic
algorithm from the filtering capability point of view. Moreover, we also showed a list of
techniques that can be easily integrated into most of the twig query joins.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:33

— XML query processing optimizations:


— Cost-based optimizations: Whereas logical-level optimization techniques perform logi-
cal query plan rewritings (e.g., they search for an optimal binary structural join order),
physical-level optimization techniques select an appropriate access method (e.g., they fo-
cus on a selection between navigational and merge-like algorithms as well as between
binary structural and holistic join algorithms).
— Selectivity estimation techniques: Cost-based optimization techniques often need to esti-
mate query selectivity. There is no single technique that is appropriate for all types of
XML collections and XML queries. Therefore, we believe that an XDBMS should include
at least two complementary techniques for accurate selectivity estimation. The first tech-
nique can be one using a kernel (since they have a high accuracy and an acceptable build
time for regular XML documents), the second technique can be sample-based (since it has
low build time, can be used for irregular XML documents, and supports XQuery).
To analyze handling of XML data in DBMSs, we depict:
— Labeling schemes: We conducted a thorough comparison of the most significant labeling
schemes in terms of basic operations, update cost, and storage cost. We showed that there
is no single best labeling scheme; therefore, we selected according to our preferences related
to the update cost, basic operation cost, and support of the LCA operation (since the other
operations are supported in all labeling schemes).
— Partitioning of XML document: Although tag partitioning is the most common partitioning,
we showed that partitioning with a high number of partition labels can speed up query
processing. For example, F&B partitioning allows us to process a TPQ without AD edges by
a single random access to a node index, and labeled path partitioning resolves a simple path
expression without any AD edges in the same way. On the other hand, the TPQ matching
problem is shifted to the problem of finding all partition labels in a schema tree, which is
an identical problem.
— There are two basic types of node indices: The document index (having a node id as a key)
and the partition index (having a partition label as a key). The partition index can be im-
proved by an index like XB-tree or XR-tree. Whereas the merge-like binary structural and
holistic joins typically utilize the partition index, navigation binary structural joins utilize
the document index.
—Methods of shredding XML data into relational tables are also described since there are good
reasons to use a relational query processor to evaluate XML queries in specific use cases.
Even though structural XML query processing is a well-researched area, there are still im-
portant problems:
— Update operations and concurrency control: Transactions represent a cornerstone concept of
every database, and the way they are implemented can have a significant impact on the
whole performance of a database system. Nevertheless, few research efforts have focused
on analyzing the performance of the proposed structural indexing mechanisms in dynamic
and concurrent environments. The TPoX benchmark (Nicola et al. 2007) is considered the
first step in this direction; however, careful evaluation of structural XML query processing
mechanisms with respect to such types of workload is still missing.
— Comparison of the performance of join operations: There is a lack of a thorough comparison
between holistic and binary join operations. In principle, holistic joins are often recognized
as a generalization of binary joins; however, this is inaccurate. In particular, if we use fully
pipelined query plans (plans where one binary join does not wait for the complete result of

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:34 R. Bača et al.

the previous one (Wu et al. 2003)), the performance of query processing for these plans will
have very similar characteristics to the plans of holistic joins, particularly when the query
has a low number of output query nodes.
— Utilization of semantic information of querying constructs: Exploitation of the semantics of
the output query nodes (Lu et al. 2010) and theoretical properties of the join operations (Bača
et al. 2012) in the query model can significantly improve the performance of the query plans.
We expect that similar improvement can be achieved by leveraging the semantics of some
other constructs (e.g., count function or position specification) in the optimization of a query
plan.
— Emerging database platforms: There is a lack of analysis of the performance of XML indexing
mechanisms and join algorithms with respect to emerging database platforms, such as main
memory database systems, highly parallel environments, or platforms that are designed to
combine various types of workloads (Leis et al. 2013; Schuh et al. 2016). Consideration these
emerging platforms in the context of structural XML query processing can generate original
and highly competitive solutions.
— Adoption of structural XML querying techniques for querying emerging data exchange for-
mats: Recently, we have witnessed the emergence of various data exchange formats such
as JSON and RDF. These formats have similar structure to XML. Many ideas for structural
XML can be utilized for efficient querying of JSON and RDF data. For example, containment
labeling in graph vertex reachability queries has been utilized in Trißl and Leser (2007). We
believe that other techniques of structural XML query processing can be useful as well.
— Querying across multiple data formats: Semi-structured data formats (e.g., XML and JSON)
are currently widely utilized in the area of multimodel databases (Lu and Holubová 2017)
in combination with other data formats (e.g., graphs) (Sakr and Pardede 2011). For example,
ArangoDB18 combines key/value pairs, JSON documents, and graph data. Or, MarkLogic19
supports a combination of distinct data models, such as XML, JSON, or RDF. Efficient evalu-
ation of cross-model queries is critical, and exploitation/extension of the described methods
is a promising strategy.
In general, XML querying involves a wide spectrum of aspects. In this article, we focused on
structural XML query processing. Other survey papers and research efforts have considered other
important aspects, such as the keyword search on XML data (Liu and Chen 2011; Zhou et al. 2016;
Agarwal et al. 2016), skyline queries (Cohen and Shiloach 2009), approximate queries (Polyzo-
tis et al. 2004b; Liu and Yan 2016), XML automata (Schwentick 2007), and querying compressed
XML formats (Maneth et al. 2008; Sakr 2009). In addition, with the widespread adoption of the Big
Data phenomena (Sakr and Gaber 2014; Sakr 2016) and the continuous increase of web size and
XML data volumes (Lehner and Sattler 2013), several distributed and parallel XML query process-
ing techniques have been proposed to tackle this challenge (Abiteboul et al. 2006; Cavendish and
Candan 2008; Abiteboul et al. 2008; Figueiredo et al. 2010; Camacho-Rodríguez et al. 2012; Bidoit
et al. 2013; Camacho-Rodríguez et al. 2015; Hammoud et al. 2015). We expect to see increasing
research activities in this direction.

REFERENCES
Serge Abiteboul, Ioana Manolescu, Neoklis Polyzotis, Nicoleta Preda, and Chong Sun. 2008. XML processing in DHT net-
works. In IEEE 24th International Conference on Data Engineering. IEEE, 606–615.

18 https://www.arangodb.com/.
19 http://www.marklogic.com/.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:35

Serge Abiteboul, Ioana Manolescu, and Emanuel Taropa. 2006. A framework for distributed XML data management. In
International Conference on Extending Database Technology. Springer, 1049–1058.
Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffrey F. Naughton. 2001. Estimating the selectivity of XML path expressions
for internet scale applications. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB).
591–600.
Manoj K. Agarwal, Krithi Ramamritham, and Prashant Agarwal. 2016. Generic keyword search over XML data. In Proceed-
ings of the 19th International Conference on Extending Database Technology (EDBT’16). 149–160. DOI:http://dx.doi.org/
10.5441/002/edbt.2016.16
Shurug Al-Khalifa and H. V. Jagadish. 2002. Multi-level operator combination in XML query processing. In Proceedings of
the 2002 ACM CIKM International Conference on Information and Knowledge Management (CIKM). 134–141.
Shurug Al-Khalifa, H. V. Jagadish, Jignesh M. Patel, Yuqing Wu, Nick Koudas, and Divesh Srivastava. 2002. Structural
joins: A primitive for efficient XML query pattern matching. In Proceedings of the 18th International Conference on Data
Engineering (ICDE). 141–152.
Norah Saleh Alghamdi, Wenny Rahayu, and Eric Pardede. 2014. Semantic-based structural and content indexing for
the efficient retrieval of queries over large XML data repositories. Future Generation Computer Systems 37 (2014),
212–231.
Muath Alrammal, Gaetan Hains, and Mohamed Zergaoui. 2011. Path tree: Document synopsis for xpath query selectiv-
ity estimation. In 2011 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS). IEEE,
321–328.
S. Amer-Yahia. 2003. Storage Techniques and Mapping Schemas for XML. Technical Report TD-5P4L7B, AT&T Labs-Research.
Morton M. Astrahan and Donald D. Chamberlin. 1975. Implementation of a structured english query language. Communi-
cations of the ACM 18, 10 (1975), 580–588.
R. Bača and M. Krátký. 2009. On the efficiency of a prefix path holistic algorithm. In Database and XML Technologies (Lecture
Notes in Computer Science), Vol. 5679. Springer–Verlag, 25–32.
Radim Bača, Michal Krátký, Tok Wang Ling, and Jiaheng Lu. 2012. Optimal and efficient generalized twig pattern process-
ing: A combination of preorder and postorder filterings. The VLDB Journal (2012), 1–25.
Radim Bača, Petr Lukáš, and Michal Krátký. 2015. Cost-based holistic twig joins. Information Systems 52 (2015), 21–33.
A. Balmin and Y. Papakonstantinou. 2005. Storing and querying XML data using denormalized relational databases. The
VLDB Journal 14, 1 (2005), 30–49.
Roger Bamford, Vinayak Borkar, Matthias Brantner, Peter M. Fischer, Daniela Florescu, David Graf, Donald Kossmann,
Tim Kraska, Dan Muresan, Sorin Nasoi, and others. 2009. XQuery reloaded. Proceedings of the VLDB Endowment 2, 2
(2009), 1342–1353.
Zhifeng Bao, Tok Wang Ling, Bo Chen, and Jiaheng Lu. 2009. Effective xml keyword search with relevance oriented ranking.
In Proceedings of the 2009 IEEE 25th International Conference on Data Engineering. IEEE, 517–528.
Radim Bača and Michal Krátký. 2008. On the efficient search of an XML twig query in large dataguide trees. In Proceedings
of the 12th International Database Engineering & Applications Symposium, IDEAS 2008. ACM Press.
Radim Bača and Michal Krátký. 2009. TJDewey – On the efficient path labeling scheme holistic approach. In Proceedings of
the 1st International Workshop on Benchmarking of XML and Semantic Web Applications, DASFAA. Springer–Verlag.
Radim Bača, Jiří Walder, Martin Pawlas, and Michal Krátký. 2010. Benchmarking the compression of XML node streams.
In Proceedings of the BenchmarX 2010 International Workshop, DASFAA. Springer-Verlag.
Rudolf Bayer and E. McCreight. 2002. Software pioneers. Springer-Verlag New York. Chapter Organization and Mainte-
nance of Large Ordered Indexes, 245–262.
Dave Beckett and Brian McBride. 2004. RDF/XML Syntax Specification (Revised). W3C. Retrieved from http://www.w3.org/
TR/rdf-syntax-grammar/.
Nicole Bidoit, Dario Colazzo, Noor Malla, Federico Ulliana, Maurizio Nolé, and Carlo Sartiani. 2013. Processing XML
queries and updates on map/reduce clusters. In Joint 2013 EDBT/ICDT Conferences, EDBT’13 Proceedings. Genoa, 745–
748. DOI:http://dx.doi.org/10.1145/2452376.2452470
Nikos Bikakis, Nektarios Gioldasis, Chrisa Tsinaraki, and Stavros Christodoulakis. 2009. Querying XML data with SPARQL.
In International Conference on Database and Expert Systems Applications. Springer, 372–381.
P. V. Biron and A. Malhotra. October 2004. XML Schema Part 2: Datatypes (Second Edition). W3C. Retrieved from
http://www.w3.org/TR/xmlschema-2/.
Christian Bizer, Tom Heath, and Tim Berners-Lee. 2009. Linked data – The story so far. International Journal on Semantic
Web and Information Systems 5, 3 (2009), 1–22.
S. Boag, D. Chamberlin, M. F. Fern’andez, D. Florescu, J. Robie, and J. Sim’eon. December 2010. XQuery 1.0: An XML Query
Language (Second Edition). W3C. Retrieved from http://www.w3.org/TR/xquery/.
Philip Bohannon, Juliana Freire, Prasan Roy, and Jérôme Siméon. 2002. From XML schema to relations: A cost-based ap-
proach to XML storage. In Proceedings of the 18th International Conference on Data Engineering (ICDE). 64–75.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:36 R. Bača et al.

Matthias Brantner, Sven Helmer, Carl-Christian Kanne, and Guido Moerkotte. 2005. Full-fledged algebraic Xpath processing
in Natix. In Proceedings of the 21st International Conference on Data Engineering (ICDE). 705–716.
Tim Bray. 2014. The Javascript object notation (Json) data interchange format. Internet Engineering Task Force (IETF) (2014).
T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. November 2008. Extensible Markup Language (XML)
1.0 (Fifth Edition). W3C. http://www.w3.org/TR/xml/.
Nicolas Bruno, Nick Koudas, and Divesh Srivastava. 2002. Holistic twig joins: Optimal XML pattern matching. In Proceedings
of the ACM SIGMOD International Conference on Management of Data. 310–321.
Jesús Camacho-Rodríguez, Dario Colazzo, and Ioana Manolescu. 2012. Building large XML stores in the Amazon cloud. In
2012 IEEE 28th International Conference on Data Engineering Workshops (ICDEW). IEEE, 151–158.
Jesús Camacho-Rodríguez, Dario Colazzo, and Ioana Manolescu. 2015. Paxquery: Efficient parallel processing of complex
xquery. IEEE Transactions on Knowledge and Data Engineering 27, 7 (2015), 1977–1991.
Dirceu Cavendish and K. Selçuk Candan. 2008. Distributed XML processing: Theory and applications. Journal of Parallel
and Distributed Computing 68, 8 (2008), 1054–1069.
Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the 17th ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 34–43.
Dunren Che, Karl Aberer, and M. Tamer Özsu. 2006. Query optimization in XML structured-document databases. The VLDB
Journal 15, 3 (2006), 263–289.
Dunren Che, Tok Wang Ling, and Wen-Chi Hou. 2012. Holistic boolean-twig pattern matching for efficient XML query
processing. IEEE Transactions on Knowledge and Data Engineering 24, 11 (2012), 2008–2024.
Bo Chen, Tok Wang Ling, M. Tamer Özsu, and Zhenzhou Zhu. 2007. On label stream partition for efficient holistic twig
join. In Database Systems for Advanced Applications, 10th International Conference (DASFAA). 807–818.
Qun Chen, Andrew Lim, and Kian Win Ong. 2003. D(k)-index: An adaptive structural summary for graph-structured data.
In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD’03). ACM, New York,
134–144.
Songting Chen, Hua-Gang Li, Jun’ichi Tatemura, Wang-Pin Hsiung, Divyakant Agrawal, and K. Selçuk Candan. 2006.
Twig2 stack: Bottom-up processing of generalized-tree-pattern queries over XML documents. In Proceedings of the 32nd
International Conference on Very Large Data Bases (VLDB). VLDB endowment, 283–294.
Ting Chen, Jiaheng Lu, and Tok Wang Ling. 2005. On boosting holism in XML twig pattern matching using structural in-
dexing techniques. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD).
455–466.
Zhimin Chen, H. V. Jagadish, Laks V. S. Lakshmanan, and Stelios Paparizos. 2003. From tree patterns to generalized tree
patterns: On efficient evaluation of XQuery. In Proceedings of the 29th International Conference on Very Large Data Bases
(VLDB). 237–248.
B. Choi, M. Mahoui, and D. Wood. 2003. On the optimality of holistic algorithms for twig queries. In Database and Expert
Systems Applications (Lecture Notes in Computer Science), Vol. 2736. Springer–Verlag, 28–37.
J. Clark and S. DeRose. November 1999. XML Path Language (XPath) Version 1.0. W3C. Retrieved from http://www.w3.org/
TR/xpath.
Sara Cohen and Maayan Shiloach. 2009. Flexible XML querying using skyline semantics. In 2009 IEEE 25th International
Conference on Data Engineering. IEEE, 553–564.
Wojciech Czerwinski, Wim Martens, Pawel Parys, and Marcin Przybylko. 2015. The (almost) complete guide to tree pat-
tern containment. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne,
Victoria, Australia, May 31-June 4, 2015. 117–130. DOI:http://dx.doi.org/10.1145/2745754.2745766
Alin Deutsch, Yannis Papakonstantinou, and Yu Xu. 2004. The NEXT logical framework for XQuery. In Proceedings of the
30th International Conference on Very Large Data Bases (VLDB). 168–179.
Paul F. Dietz. 1982. Maintaining order in a linked list. In Proceedings of 14th Annual ACM Symposium on Theory of Computing
(STOC 1982). 122–127.
Matthias Droop, Markus Flarer, Jinghua Groppe, Sven Groppe, Volker Linnemann, Jakob Pinggera, Florian Santner, Michael
Schier, Felix Schöpf, Hannes Staffler, and others. 2008. Bringing the XML and semantic web worlds closer: Transform-
ing XML into RDF and embedding XPath into SPARQL. In International Conference on Enterprise Information Systems.
Springer, 31–45.
Matthias Droop, Markus Flarer, Jinghua Groppe, Sven Groppe, Volker Linnemann, Jakob Pinggera, Florian Santner, Michael
Schier, Felix Schöpf, Hannes Staffler, and others. 2007. Translating xpath queries into sparql queries. In OTM Confeder-
ated International Conferences on the Move to Meaningful Internet Systems. Springer, 9–10.
F. Du, S. Amer-Yahia, and J. Freire. 2004. ShreX: Managing XML documents in relational databases. In VLDB’04: Proceedings
of 30th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., Toronto, ON, Canada,
1297–1300.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:37

E.-S. M. El-Alfy, S. Mohammed, and A. F. Barradah. 2015. XHQE: A hybrid system for scalable selectivity estimation of
XML queries. Information Systems Frontiers (2015), 1–17.
Iman Elghandour, Ashraf Aboulnaga, Daniel C. Zilio, and Calisto Zuzarte. 2013. Recommending XML physical designs for
XML databases. VLDB Journal 22, 4 (2013), 447–470. DOI:http://dx.doi.org/10.1007/s00778-012-0298-2
David C. Faye, Olivier Curé, and Guillaume Blin. 2012. A survey of RDF storage approaches. Arima Journal 15 (2012), 11–35.
Donald Feinberg and Adam M. Ronthal. 2016. Hype Cycle for Information Infrastructure, 2016. Technical Report G00304182.
Gartner.
Thorsten Fiebig, Sven Helmer, Carl-Christian Kanne, Guido Moerkotte, Julia Neumann, Robert Schiele, and Till Westmann.
2002. Anatomy of a native XML base management system. VLDB Journal 11, 4 (2002), 292–314.
Guilherme Figueiredo, Vanessa Braganholo, and Marta Mattoso. 2010. Processing queries over distributed XML databases.
Journal of Information and Data Management 1, 3 (2010), 455.
Jan Finis, Robert Brunel, Alfons Kemper, Thomas Neumann, Franz Färber, and Norman May. 2013. DeltaNI: An efficient
labeling scheme for versioned hierarchical data. In Proceedings of the 2013 ACM SIGMOD International Conference on
Management of Data. ACM, 905–916.
Jan Finis, Robert Brunel, Alfons Kemper, Thomas Neumann, Norman May, and Franz Faerber. 2016. Order indexes:
Supporting highly dynamic hierarchical data in relational main-memory database systems. VLDB Journal (2016),
1–26.
Damien K. Fisher and Sebastian Maneth. 2007. Structural selectivity estimation for XML documents. In Proceedings of the
23rd International Conference on Data Engineering (ICDE). 626–635.
D. Florescu and D. Kossmann. 1999. Storing and querying XML data using an RDMBS. IEEE Data Engineering Bulletin 22,
3 (1999), 27–34.
Marcus Fontoura, Vanja Josifovski, Eugene Shekita, and Beverly Yang. 2005. Optimizing cursor movement in holistic twig
joins. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM’05).
ACM, New York, 784–791.
Juliana Freire, Jayant R. Haritsa, Maya Ramanath, Prasan Roy, and Jérôme Siméon. 2002. StatiX: Making XML count. In
Proceedings of the ACM SIGMOD International Conference on Management of Data. 181–191.
Haris Georgiadis, Minas Charalambides, and Vasilis Vassalos. 2010. Efficient physical operators for cost-based XPath exe-
cution. In Proceedings of the 13th International Conference on Extending Database Technology. ACM, 171–182.
Roy Goldman and Jennifer Widom. 1997. DataGuides: Enabling query formulation and optimization in semistructured
databases. In Proceedings of the International Conference on Very Large Data Bases, VLDB 1997. 436–445.
Nils Grimsmo, Truls A. Bjorklund, and Magnus Lie Hetland. 2010. Fast optimal twig joins. In Proceedings of the 36st Inter-
national Conference on Very Large Data Bases, VLDB 2010. VLDB Endowment.
Torsten Grust. 2005. Purely relational FLWORs. In Proceedings of the 2nd International Workshop on XQuery Implementation,
Experience and Perspectives (XIME-P), in cooperation with ACM SIGMOD.
Torsten Grust. June 4-6, 2002. Accelerating XPath location steps. In Proceedings of the 2002 ACM SIGMOD, Madison, USA.
ACM Press.
Torsten Grust, Jan Rittinger, and Jens Teubner. 2008. Pathfinder: XQuery off the relational shelf. IEEE Data Engineering
Bulletin 31, 4 (2008), 7–14.
Torsten Grust, Sherif Sakr, and Jens Teubner. 2004. XQuery on SQL hosts. In Proceedings of the 29th International Conference
on Very Large Data Bases (VLDB). 252–263.
Torsten Grust, Maurice van Keulen, and Jens Teubner. 2003. Staircase join: Teach a relational DBMS to watch its (axis)
steps. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 524–525.
Alan Halverson, Josef Burger, Leonidas Galanis, Ameet Kini, Rajasekar Krishnamurthy, Ajith Nagaraja Rao, Feng Tian,
Stratis D. Viglas, Yuan Wang, Jeffrey F. Naughton, and others. 2003. Mixed mode XML query processing. In Proceedings
of the 29th International Conference on Very Large Data Bases-Volume 29. VLDB Endowment, 225–236.
Mohammad Hammoud, Dania Abed Rabbou, Reza Nouri, Seyed-Mehdi-Reza Beheshti, and Sherif Sakr. 2015. DREAM:
Distributed RDF engine with adaptive query planner and minimal communication. PVLDB 8, 6 (2015), 654–665.
T. Härder, M. Haustein, C. Mathis, and M. Wagner. 2007. Node labeling schemes for dynamic XML documents reconsidered.
Data & Knowledge Engineering 60, 1 (2007), 126–149.
IBM. 2016. DB2 Database Software. Retrieved from http://www-01.ibm.com/software/data/db2/.
ISO/IEC 9075-14:2003. 2006. Part 14: XML-Related Specifications (SQL/XML). Int. Organization for Standardization.
Sayyed Kamyar Izadi, Mostafa S. Haghjoo, and Theo Härder. 2012. S3: Processing tree-pattern XML queries with all logical
operators. Data & Knowledge Engineering 72 (2012), 31–62.
H. V. Jagadish, Shurug Al-Khalifa, Adriane Chapman, Laks V. S. Lakshmanan, Andrew Nierman, Stelios Paparizos, Jignesh
M. Patel, Divesh Srivastava, Nuwee Wiwatwattana, Yuqing Wu, and Cong Yu. 2002. TIMBER: A native XML database.
VLDB Journal 11, 4 (2002), 274–291.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:38 R. Bača et al.

H. V. Jagadish, Laks V. S. Lakshmanan, Divesh Srivastava, and Keith Thompson. 2001. TAX: A tree algebra for XML. In
Proceedings of the 8th International Workshop on Database Programming Languages (DBPL). 149–164.
Haifeng Jiang, Hongjun Lu, and Wei Wang. 2004. Efficient processing of twig queries with OR-predicates. In Proceedings
of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD). 59–70.
Haifeng Jiang, Hongjun Lu, Wei Wang, and Beng Chin Ooi. 2003a. XR-tree: Indexing XML data for efficient structural joins.
In Proceedings of the 19th International Conference on Data Engineering (ICDE). 253–263.
Haifeng Jiang, Wei Wang, Hongjun Lu, and Jeffrey Xu Yu. 2003b. Holistic twig joins on indexed XML documents. In Pro-
ceedings of 29th International Conference on Very Large Data Bases (VLDB). 273–284.
Enhua Jiao, Tok Wang Ling, and Chee Yong Chan. 2005. PathStack-: A holistic path join algorithm for path query with
not-predicates on XML data. In Proceedings of the 10th International Conference on Database Systems for Advanced Ap-
plications (DASFAA). 113–124.
R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth. 2002a. Covering indexes for branching path queries. In Proceedings
of ACM SIGMOD 2002. ACM Press, 133–144.
R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. 2002b. Exploiting local similarity for indexing paths in graph-structured
data. In Proceedings of the 18th International Conference on Data Engineering, 2001. 129–140. DOI:http://dx.doi.org/10.
1109/ICDE.2002.994703
Lukas Kircher, Michael Grossniklaus, Christian Grün, and Marc H. Scholl. 2015. Efficient structural bulk updates on the
pre/dist/size XML encoding. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 447–458.
M. Klettke and H. Meyer. 2000. XML and object-relational database systems – Enhancing structural mappings based on
statistics. In Lecture Notes in Computer Science, Vol. 1997. 151–170.
Michal Krátký, Jaroslav Pokorný, and Václav Snášel. 2004. Implementation of XPath axes in the multi-dimensional approach
to indexing XML data. In Current Trends in Database Technology, International Workshop on Database Technologies for
Handling XML information on the Web, DataX, EDBT 2004 (Lecture Notes in Computer Science), Vol. 3268. Springer–Verlag.
A. Kuckelberg and R. Krieger. 2003. Efficient structure oriented storage of XML documents using ORDBMS. In Proceedings of
the VLDB’02 Workshop EEXTT and CAiSE’02 Workshop DTWeb on Efficiency and Effectiveness of XML Tools and Techniques
and Data Integration over the Web – Revised Papers. Springer-Verlag, London, UK, 131–143.
Wolfgang Lehner and Kai-Uwe Sattler. 2013. Web-Scale Data Management for the Cloud. Springer. Retrieved from
http://www.springer.com/computer/database+management+%26+information+retrieval/book/978-1-4614-6855-4.
Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory
databases. In 2013 IEEE 29th International Conference on Data Engineering (ICDE) . IEEE, 38–49.
Guoliang Li, Jianhua Feng, Jianyong Wang, and Lizhu Zhou. 2007. Effective keyword search for valuable lcas over xml
documents. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. ACM, 31–40.
Hanyu Li, Mong-Li Lee, Wynne Hsu, and Gao Cong. 2006. An estimation system for XPath expressions. In Proceedings of
the 22nd International Conference on Data Engineering (ICDE). 54.
Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, and Ronald Parr. 2002. XPathLearner: An on-line self-
tuning Markov histogram for XML path selectivity estimation. In Proceedings of International Conference on Very Large
Data Bases (VLDB). 442–453.
Rung-Ren Lin, Ya-Hui Chang, and Kun-Mao Chao. 2013. A compact and efficient labeling scheme for XML documents. In
International Conference on Database Systems for Advanced Applications. Springer, 269–283.
Jian Liu, Z. M. Ma, and Li Yan. 2013. Efficient labeling scheme for dynamic XML trees. Information Sciences 221 (2013),
338–354.
Jian Liu and DL Yan. 2016. Answering approximate queries over XML data. IEEE Transactions on Fuzzy Systems 24, 2 (2016),
288–305.
Jian Liu and X. X. Zhang. 2016. Dynamic labeling scheme for XML updates. Knowledge-Based Systems (2016).
Ziyang Liu and Yi Chen. 2011. Processing keyword search on XML: A survey. World Wide Web 14, 5–6 (2011), 671–707.
Jiaheng Lu, Ting Chen, and Tok Wang Ling. 2004. Efficient processing of XML twig patterns with parent child edges: A look-
ahead approach. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management
(CIKM). 533–542.
Jiaheng Lu and Irena Holubová. 2017. Multi-model data management: What’s new and what’s next?. In Proceedings of
the 20th International Conference on Extending Database Technology, EDBT 2017, Venice, Italy, March 21–24, 2017. Volker
Markl, Salvatore Orlando, Bernhard Mitschang, Periklis Andritsos, Kai-Uwe Sattler, and Sebastian Breß (Eds.). Open-
Proceedings.org, 602–605. DOI:http://dx.doi.org/10.5441/002/edbt.2017.80
J. Lu, T. W. Ling, C. Y. Chan, and T. Chen. 2005a. From region encoding to extended Dewey: On efficient processing of XML
twig pattern matching. In Proceedings of the 31st Conference on Very Large Data Bases, VLDB 2005. VLDB Endowment,
193–204.
Jiaheng Lu, Tok Wang Ling, Zhifeng Bao, and Chen Wang. 2010. Extended XML tree pattern matching: Theories and
algorithms. IEEE Transactions on Knowledge and Data Engineering (TKDE) 2010 23 (2010), 402–416. Issue 3.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:39

Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, and Wei Ni. 2005b. Efficient processing of ordered XML twig pattern.
In Proceedings of DEXA 2005 (Lecture Notes in Computer Science), Vol. 3588. Springer-Verlag, 300–309.
Cheng Luo, Zhewei Jiang, Wen-Chi Hou, Feng Yu, and Qiang Zhu. 2009. A sampling approach for XML query selectivity
estimation. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database
Technology. ACM, 335–344.
Imam Machdi, Toshiyuki Amagasa, and Hiroyuki Kitagawa. 2009. XML data partitioning strategies to improve parallelism
in parallel holistic twig joins. In Proceedings of the 3rd International Conference on Ubiquitous Information Management
and Communication. ACM, 471–480.
Imam Machdi, Toshiyuki Amagasa, and Hiroyuki Kitagawa. 2010. Parallel holistic twig joins on a multi-core system. Inter-
national Journal of Web Information Systems 6, 2 (2010), 149–177.
Sebastian Maneth, Nikolay Mihaylov, and Sherif Sakr. 2008. XML tree structure compression. In Proceedings of the 19th
International Workshop on Database and Expert Systems Applications (DEXA’08). 243–247. DOI:http://dx.doi.org/10.1109/
DEXA.2008.41
Ioana Manolescu, Yannis Papakonstantinou, and Vasilis Vassalos. 2009. XML tuple algebra. In Encyclopedia of Database
Systems. Springer, 3640–3646.
Christian Mathis. 2007. Extending a tuple-based XPath algebra to enhance evaluation flexibility. Informatik-Forschung und
Entwicklung 21, 3–4 (2007), 147–164.
Norman May, Sven Helmer, and Guido Moerkotte. 2004. Nested queries and quantifiers in an ordered context. In 2013 IEEE
29th International Conference on Data Engineering (ICDE). IEEE Computer Society, 239–239.
Philippe Michiels, George A. Mihaila, and Jérôme Siméon. 2007. Put a tree pattern in your algebra. In Proceedings of the
23rd International Conference on Data Engineering (ICDE). 246–255.
Microsoft. 2016. Microsoft SQL Server 2016. Retrieved from http://www.microsoft.com/en-us/server-cloud/products/
sql-server/.
Salahadin Mohammed, Ahmad F. Barradah, and El-Sayed M. El-Alfy. 2016. Selectivity estimation of extended XML query
tree patterns based on prime number labeling and synopsis modeling. Simulation Modelling Practice and Theory 64
(2016), 30–42.
Salahadin Mohammed, El-Sayed M. El-Alfy, and Ahmad F. Barradah. 2015. Improved selectivity estimator for XML queries
based on structural synopsis. World Wide Web 18, 4 (2015), 1123–1144.
MonetDB BV. 2008. MonetDB/XQuery. MonetDB B.V.http://www.monetdb.org/XQuery.
Matthias Nicola, Irina Kogan, and Berni Schiefer. 2007. An XML transaction processing benchmark. In Proceedings of the
2007 ACM SIGMOD International Conference on Management of Data. ACM, 937–948.
Bo Ning, Guoren Wang, and Jeffrey Xu Yu. 2008. A holistic algorithm for efficiently evaluating Xtwig joins. In Proceedings
of the 13th International Conference on Database Systems for Advanced Applications (DASFAA 2008) (Lecture Notes in
Computer Science), Vol. 4947. Springer–Verlag, 571–579.
P. O’Neil, E. O’Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. 2004. ORDPATHs: Insert-friendly XML node labels. In
Proceedings of the 2004 ACM International Conference on Management of Data, SIGMOD 2004. 903–908.
Oracle. 2016. Oracle Database 12c. (2016). https://www.oracle.com/database/.
Fatma Özcan, Normen Seemann, and Ling Wang. 2008. XQuery rewrite optimization in IBM DB2 pureXML. IEEE Data
Engineering Bulletin 31, 4 (2008), 25–32.
Yinfei Pan, Ying Zhang, and Kenneth Chiu. 2008. Parsing XML using parallel traversal of streaming trees. In International
Conference on High-Performance Computing. Springer, 142–156.
Neoklis Polyzotis and Minos Garofalakis. 2006a. XSKETCH synopses for XML data graphs. ACM Transactions on Database
Systems 31 (September2006), 1014–1063. Issue 3.
Neoklis Polyzotis, Minos Garofalakis, and Yannis Ioannidis. 2004b. Approximate XML query answers. In Proceedings of the
2004 ACM SIGMOD International Conference on Management of Data. ACM, 263–274.
Neoklis Polyzotis, Minos Garofalakis, and Yannis Ioannidis. 2004a. Selectivity estimation for XML twigs. In Proceedings of
the 20th International Conference on Data Engineering, 2004. IEEE, 264–275.
Neoklis Polyzotis and Minos N. Garofalakis. 2002. Structure and value synopses for XML data graphs. In Proceedings of the
International Conference on Very Large Data Bases (VLDB). IEEE, 466–477.
Neoklis Polyzotis and Minos N. Garofalakis. 2006b. XCluster synopses for structured XML content. In Proceedings of the
International Conference on Data Engineering (ICDE). IEEE, 63.
Viswanath Poosala and Yannis E. Ioannidis. 1997. Selectivity estimation without the attribute value independence sssump-
tion. VLDB Journal Vol. 97. 486–495.
Lu Qin, Jeffrey Xu Yu, and Bolin Ding. 2007. TwigList: Make twig pattern matching fast. In Proceedings of the 10th Interna-
tional Conference on Database Systems for Advanced Applications (DASFAA). 850–862.
Christopher Re, Jérôme Siméon, and Mary F. Fernández. 2006. A complete and efficient algebraic compiler for XQuery. In
Proceedings of the 22nd International Conference on Data Engineering (ICDE). 14.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
64:40 R. Bača et al.

Jonathan Robie, Matthias Brantner, Daniela Florescu, Ghislain Fourny, and Till Westmann. 2012. JSONiq: XQuery for JSON.
JSON for XQuery (2012), 63–72.
Kanda Runapongsa and Jignesh M. Patel. 2002. Storing and querying XML data in object-relational DBMSs. In Proceedings of
the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers
(EDBT’02). Springer-Verlag, London, 266–285.
Sherif Sakr. 2007. Cardinality-aware and purely relational implementation of an XQuery processor. Ph.D. Dissertation. Uni-
versity of Konstanz. http://www.ub.uni-konstanz.de/kops/volltexte/2007/3259/.
Sherif Sakr. 2008. Algebra-based XQuery cardinality estimation. International Journal of Web Information Systems 4, 1 (2008),
7–46.
Sherif Sakr. 2009. XML compression techniques: A survey and comparison. Journal of Computer System Science 75, 5 (2009),
303–322.
Sherif Sakr. 2016. Big Data 2.0 Processing Systems - A Survey. Springer. DOI:http://dx.doi.org/10.1007/978-3-319-38776-5
Sherif Sakr and Ghazi Al-Naymat. 2010. Relational processing of RDF queries: A survey. ACM SIGMOD Record 38, 4 (2010),
23–28.
Sherif Sakr and Mohamed Medhat Gaber (Eds.). 2014. Large Scale and Big Data - Processing and Management. Auerbach
Publications. DOI:http://dx.doi.org/10.1201/b17112
Sherif Sakr, Anna Liu, Daniel M. Batista, and Mohammad Alomari. 2011. A survey of large scale data management ap-
proaches in cloud environments. IEEE Communications Surveys and Tutorials 13, 3 (2011), 311–336. DOI:http://dx.doi.
org/10.1109/SURV.2011.032211.00087
Sherif Sakr and Eric Pardede (Eds.). 2011. Graph Data Management: Techniques and Applications. IGI Global. DOI:
http://dx.doi.org/10.4018/978-1-61350-053-8
Airi Salminen and Frank Wm. Tompa. 1994. PAT expressions: An algebra for text search. Acta Linguistica Hungarica 41, 1
(1994), 277–306.
Stefan Schuh, Xiao Chen, and Jens Dittrich. 2016. An experimental comparison of thirteen relational equi-joins in main
memory. In Proceedings of the 2016 International Conference on Management of Data. ACM, 1961–1976.
Thomas Schwentick. 2007. Automata for XML: A survey. Journal of Computer System Science 73, 3 (2007), 289–315.
J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton. 1999. Relational databases for querying
XML documents: Limitations and opportunities. In VLDB’99: Proceedings of the 25th International Conference on Very
Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 302–314.
Lila Shnaiderman and Oded Shmueli. 2015. Multi-core processing of XML twig patterns. IEEE Transactions on Knowledge
and Data Engineering 27, 4 (2015), 1057–1070.
Adam Silberstein, Hao He, Ke Yi, and Jun Yang. 2005. BOXes: Efficient maintenance of order-based labeling for dynamic
XML data. In Proceedings of the 21st International Conference on Data Engineering (ICDE). IEEE, 285–296.
Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. 2002. Storing
and querying ordered XML using a relational database system. In Proceedings of the ACM International Conference on
Management of Data (SIGMOD’02). ACM Press, New York.204–215.
Jens Teubner, Torsten Grust, Sebastian Maneth, and Sherif Sakr. 2008. Dependable cardinality forecasts for XQuery. Pro-
ceedings of the VLDB Endowment (PVLDB) 1, 1 (2008), 463–477.
H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. October 2004. XML Schema Part 1: Structures (Second Edition).
W3C. http://www.w3.org/TR/xmlschema-1/.
Pingfang Tian, Dan Luo, Yaoyao Li, and Jinguang Gu. 2013. XML multi-core query optimization based on task preemption
and data partition. In Joint International Semantic Technology Conference. Springer, 294–305.
Silke Trißl and Ulf Leser. 2007. Fast and practical indexing and querying of very large graphs. In Proceedings of the 2007
ACM SIGMOD International Conference on Management of Data. ACM, 845–856.
University of Washington Database Group. 2002. The XML Repository. Retrieved from http://www.cs.washington.edu/
research/xmldatasets/.
Henrique Valer, Caetano Sauer, and Theo Härder. 2013. XQuery processing over NoSQL stores. In Grundlagen von Daten-
banken. Citeseer, 75–80.
Andreas M. Weiner and Theo Härder. 2009. Using structural joins and holistic twig joins for native XML query optimization.
In Advances in Databases and Information Systems (LNCS), Vol. 5739. Springer, Berlin, 149–163.
Andreas M. Weiner and Theo Härder. 2010. An integrative approach to query optimization in native XML database man-
agement systems. In Proceedings of the 14th International Database Engineering & Applications Symposium (IDEAS’10).
ACM, New York, 64–74.
Y. Wu, J. M. Patel, and H. Jagadish. 2003. Structural join order selection for XML query optimization. In Proceedings of ICDE
2003. IEEE CS, 443–454.
Yuqing Wu, Jignesh M. Patel, and H. V. Jagadish. 2002. Estimating answer sizes for XML queries. In Proceedings of the 8th
International Conference on Extending Database Technology (EDBT). 590–608.

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.
Structural XML Query Processing 64:41

Yanghua Xiao, Ji Hong, Wanyun Cui, Zhenying He, Wei Wang, and Guodong Feng. 2012. Branch code: A labeling scheme
for efficient query answering on trees. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 654–665.
W. Xiao-ling, L. Jin-feng, and D. Yi-sheng. 2003. An adaptable and adjustable mapping from XML data to tables in RDB. In
Proceedings of the VLDB’02 Workshop EEXTT and CAiSE’02 Workshop DTWeb. Springer-Verlag, London. 117–130.
Liang Xu, Tok Wang Ling, Huayu Wu, and Zhifeng Bao. 2009. DDE: From Dewey to a fully dynamic XML labeling scheme.
In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. ACM, 719–730.
B. Yang, M. Fontoura, E. Shekita, S. Rajagopalan, and K. Beyer. 2004. Virtual cursors for XML joins. In Proceedings of the
13th ACM International Conference on Information and Knowledge Management. ACM, 523–532.
Masatoshi Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. 2001. XRel: A path-based approach to storage and retrieval
of XML documents using relational databases. ACM Transactions on Internet Technology 1, 1 (2001), 110–141.
Tian Yu, Tok Wang Ling, and Jiaheng Lu. 2006. TwigStackList-: A holistic twig join algorithm for twig query with not-
predicates on XML data. In Proceedings of the 11th International Conference on Database Systems for Advanced Applica-
tions (DASFAA). 249–263.
Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, and Guy Lohman. 2001. On supporting containment queries in
relational database management systems. In Proceedings of the 2001 ACM SIGMOD International Conference on Manage-
ment of Data. ACM Press, 425–436.
Ning Zhang, Peter J. Haas, Vanja Josifovski, Guy M. Lohman, and Chun Zhang. 2005. Statistical learning techniques for
costing XML queries. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 289–300.
Ning Zhang, M. Tamer Özsu, Ashraf Aboulnaga, and Ihab F. Ilyas. 2006. XSEED: Accurate and fast cardinality estimation
for XPath queries. In Proceedings of the 22nd International Conference on Data Engineering (ICDE). IEEE, 61.
Junfeng Zhou, Wei Wang, Ziyang Chen, Jeffrey Xu Yu, Xian Tang, Yifei Lu, and Yukun Li. 2016. Top-down XML keyword
query processing. IEEE Transactions on Knowledge and Data Engineering 28, 5 (2016), 1340–1353.

Received May 2016; revised April 2017; accepted May 2017

ACM Computing Surveys, Vol. 50, No. 5, Article 64. Publication date: September 2017.

You might also like